Assignment 2: Factor Analysis

Begin working with this when you have completed the Hands-On Exercise 2.


Your task:

Practice with 2-3 sets of variables (one at a time) in the USHS data, trying different number of factors (2-3), comparing orthogonal and oblique rotations, and hence going through ALL the phases required to do FA (as instructed and shown in the Hands-On-Exercise 2).

It should be possible to analyse these particular sets of variables in the USHS data with FA: (Some of them might be easier to interpret than others).

k21_1:k21_9, # various things in life (obs: originally wrong scale! "3" -> NA!), 9 variables
k22_1:k22_10, # how often (felt various things), 10 variables
k23:k34, # General Health Questionnaire (1-4), 12 variables
k46_1:k46_11, # on how many days/week eat various food/(11)skip lunch (0,...,7=every day), 11 variables
k95_1:k95_18, # feelings of studies (1:disagree,6:agree), 18 variables

OBS: If you use the k21* variables, remember to wrangle them first (“3” - NA!) See Hands-On-Exercise 1b!

Note: With FA, the variables can have different directions (like the items 2 & 3 in k22*), even different scales.

PERHAPS you could also try to combine variables across different measures: some items of k21 together with some items from k95, for instance! (I have not tried to analyse that kind of combinations, but it should be perfectly possible, at least if the items would be somehow substantially related to each other. TRY IT!!)

What is also required in this Assignment 2, is your interpretations (you don’t have to be a substance expert, I just encourage you to practice, as those phenomena should be easy enough to understand without a special expertise). Interpreting your analyses is very important, and this course is a safe place to practice that part, too.

So TRY to interpret the factors you find and give them describing NAMES.

Finally, show that you can work with the new factor score variables in the original data: visualise those scores with histograms, box plots, scatter plots, using also some other variables of the USHS. You will find a lot of good R code for producing such visualisations in the earlier exercises.


Analysing the USHS data with Factor Analysis (step by step)

Use as many R code chunks and text areas as you need - it’s now all up to you!

Please choose only your BEST FACTOR ANALYSIS and show it here step by step!

Start by reading in the data and converting the factors (gender and age), just like in the exercise.

Enviroment and data preparation

1. Package loading

library (tidyverse)#a combination of useful packages
library(dplyr) # data wrangling 
library(psych) # factor analysis
library(patchwork) # picture layout
library(finalfit) # descriptive and inferential statistics well-tabulated
library(naniar)#install.packages("naniar")#Visualizing missing data
library(broom)#tidy tables
library(ggplot2)# plotting graphs 
library(tidyr)#data sheet to longer or wider
library(GPArotation)#matrix rotation

2. Data wrangling

2.1 Load USHS data

#read raw data
USHS  <-  read.csv("daF3224e.csv", sep = ";")
#select reduced set of variables into a new data set
USHSpart  <-  USHS  %>% 
  select(
    fsd_id,
    k1, # age
    k2, # gender (1=male, 2=female, 3=not known/unspecified), NA (missing)
    bv1, # university (from register), see label info! (40 different codes)
    bv3, # field of study (from register), various codes
    bv6, # higher educ institution (0=uni applied sciences, 1=university)
    k7, # learning disability
    k8_1:k8_4, # current well-being
    k9_1:k9_30, # symptoms (past month), skip 31 ("other") due missingness
    k11a, k11b, # height (cm) a=male, b=female
    k12a, k12b, # weight (kg) a=male, b=female
    k13, # what to you think of your weight? (1=under,5=over)
    k14:k20, # various questions (mostly yes/no)
    k21_1:k21_9, # various things in life (obs: originally wrong scale! "3" -&gt; NA!)
    k22_1:k22_10, # how often (felt various things) (obs: 2 &amp; 3 should be inversed! see quest.)
    k23:k34, # General Health Questionnaire (1-4, directions already OK)
   k95_1:k95_18) # (end of select!) 

2.2 Variable recreation and generation

  In this section, I first generated height, weight, age and gender variables. Considering height and weight are better interpreted by referencing each other, I then generated a Body Mass Index (BMI) variable that merges the pair using well-accepted formula: \[ BMI=\frac{Weight (kg)}{Height (m)^{2}} \] Next, I generated a new categorical variable by slicing BMI into underweight (<18.5), normal weight(~25), overweight(~30) and obese(>30). Previous studies have found that BMI is a good gauge of risk for diseases that can occur with more body fat, indicating the higher our BMI, the higher our risk for certain diseases such as heart disease, high blood pressure, type 2 diabetes, gallstones, breathing problems, and certain cancers. As such, it would be interesting to explore its relation with Finnish student’s general health, dietary status and feeling about study.

# Recreate height, weight, age and gender variables
USHSv2  <-  USHSpart  %>% 
  mutate(height = if_else(is.na(k11a), k11b, k11a), #merge height variables of two sexes into one
         weight = if_else(is.na(k12a), k12b, k12a), #merge weight variables of two sexes into one
         gender = k2  %>%  factor()  %>%  fct_recode("Male" = "1",
                                                 "Female" = "2",
                                                 "Not known" = "3"), #add variable name and level label to gender
         age = k1  %>%  
           factor()  %>%  
           fct_recode("19-21" = "1", "22-24" = "2", "25-27" = "3", 
                      "28-30" = "4", "31-33" = "5", "33+" = "6"))#add variable name and level label to age

# Generate BMI(body mass index) and BMI.factor(4 categories)
USHSv2 <- USHSv2 %>% 
  mutate(BMI = weight/(height/100)^2) # adopting the formula of BMI

USHSv2 <- USHSv2 %>% 
  mutate(BMI.factor = BMI %>%  #turn BMI into factor and add labels(underweight:<18.5,                                                   #normal weight: ~25, overweight: ~30; obese: >30)
                    cut(breaks=c(1,18.5,25,30,100), include.lowest = T) %>% 
                    fct_recode("Underweight" = "[1,18.5]", 
                               "Normal weight" = "(18.5,25]",
                               "Overweight" = "(25,30]",
                               "Obese" = "(30,100]") %>% 
                    ff_label("BMI ranges"))

USHSv2[, c(42:45)] <- list(NULL) #  I already got "gender". K11a~k12b were removed.

USHSv2 %>% 
  select(height, weight, age, gender, BMI, BMI.factor) %>% 
  head %>%  
  knitr::kable() # examine the data #sum(is.na(USHSv2$k1)) #apply(USHSv2, 2, function(x)sum(is.na(x))) %>% tidy()#examine NAs for each variable
height weight age gender BMI BMI.factor
180 80 NA Male 24.69136 Normal weight
172 55 22-24 Female 18.59113 Normal weight
186 94 25-27 Male 27.17077 Overweight
183 85 22-24 Male 25.38147 Overweight
185 66 22-24 Male 19.28415 Normal weight
164 51 NA Female 18.96193 Normal weight
unique(USHSv2$k21_1)
## [1] -2  0 -1  2  1  3 NA

2.3 data cleansing

  In item k21(k21_1 to k21_9), the choice “difficult to say” was assigned a value of 3, which could be misleading since these were ordinal variables where 3 indicated a specific strength instead of NA. In this section, this was corrected by converting 3 to NA.

#inspect the unique values in k21_1 through k21_9
USHSv2 %>% 
  select(starts_with(("k21"))) %>% apply(.,2, function(x)unique(x))%>% knitr::kable()
k21_1 k21_2 k21_3 k21_4 k21_5 k21_6 k21_7 k21_8 k21_9
-2 -1 0 1 1 1 -1 0 0
0 0 1 0 0 0 0 -1 -1
-1 1 -1 3 2 2 -2 1 1
2 2 2 2 -2 -1 1 2 2
1 -2 -2 -2 -1 NA 2 -2 3
3 3 3 -1 3 -2 3 3 -2
NA NA NA NA NA 3 NA NA NA

  Finding: the value 3(which I want to convert to NA) is present now.

#Convert the value 3 in k21_1 to k21_9 to NA
USHSv2 <- USHSv2 %>% 
  mutate(across(starts_with("k21"), ~replace(., . == 3, NA)))

#inspect the unique values in k21_1 through k21_9 in the updated data set
#inspect if all 3s convert to NA
USHSv2 %>% select(starts_with(("k21"))) %>% 
  apply(.,2, function(x)unique(x)) %>% knitr::kable()
k21_1 k21_2 k21_3 k21_4 k21_5 k21_6 k21_7 k21_8 k21_9
-2 -1 0 1 1 1 -1 0 0
0 0 1 0 0 0 0 -1 -1
-1 1 -1 NA 2 2 -2 1 1
2 2 2 2 -2 -1 1 2 2
1 -2 -2 -2 -1 NA 2 -2 NA
NA NA NA -1 NA -2 NA NA -2

  Finding: 3s were successfully replaced.

2.4 data inspecting

  In this section, missing values of all the variables were inspected to find out if there was a pattern.

#inspect the  NAs in each column
#To get a clear view, separate into 4 pictures
ncol(USHSv2)#get the number of columns
## [1] 104
USHSv2 %>%  select(1:25) %>% vis_miss() 

USHSv2 %>%  select(26:50) %>% vis_miss()

USHSv2 %>%  select(51:75) %>% vis_miss()

USHSv2 %>%  select(76:104) %>% vis_miss() 

  Findings were variables k1, k7, and k9_1~k9_3 suffered from quite a number of NAs, suggesting a quantitative inspection. Besides, several horizontal black lines (around 9) showed in the graph, suggesting these respondents tend to give more NAs consecuteively than others.

#a closer look of the identified high-NA variables
USHSv2 %>%  select(k1,k7,k9_1:k9_30) %>% vis_miss()

#calclate the percent of NAs for these variables
USHSv2 %>% select (k1,k7,k9_1:k9_30) %>% 
  apply(.,2,function(x)sum(is.na(x))/nrow(.)) %>% 
  tidy
## # A tibble: 32 x 2
##    names      x
##    <chr>  <dbl>
##  1 k1    0.0669
##  2 k7    0.0408
##  3 k9_1  0.0312
##  4 k9_2  0.0936
##  5 k9_3  0.0305
##  6 k9_4  0.105 
##  7 k9_5  0.0434
##  8 k9_6  0.0720
##  9 k9_7  0.0920
## 10 k9_8  0.0646
## # … with 22 more rows

  Findings were k1 (6.7% NAs), k7(4.1% NAs), k9_1 (3.1% NAs), k9_3 (3.0% NAs), k9_5 (4.3% NAs), k9_8 (6.5%), 9_10 (5.0% NAs), 9_13 (5.1 NAs), 9_24 (6.3% NAs), 9_25 (6.5% NAs) and 9_26 (5.9% NAs) were actually acceptable in the terms of NA proportion. Other variables were with NA proportion>7%, with some of them reaching 12%, suggesting possible bias such as floor effect. Analysis of k9 should be extremely careful.

2.4 Variable selecting

  In this section, variables for general heath k23~k34 were selected into a new data set USHSghq for factor analysis #1 (general health). Variables k95_1~k95_18 were selected into a new data set USHSstu for factor analysis #2 (study burnout).

#k23~k34 were selected into a new data set USHSghq for factor analysis #1
USHSghq <- USHSv2 %>% select(k23:k34)
#inspect the unique value of USHSghq to tell if data wrangling is needed
USHSghq%>% apply(2, function(x)unique(x))
##      k23 k24 k25 k26 k27 k28 k29 k30 k31 k32 k33 k34
## [1,]   2   2   2   2   3   2   2   2   1   1   1   2
## [2,]   3   1   3   1   2   1   4   1   2   2   2   3
## [3,]   4   3   1   3   1   3   3   3   3   3   3   1
## [4,]  NA   4   4  NA   4   4   1  NA   4   4   4   4
## [5,]   1  NA  NA   4  NA  NA  NA   4  NA  NA  NA  NA
#k95_1~k95_18 were selected into a new data set USHSstu for factor analysis #2
USHSstu <- USHSv2 %>% select(k95_1:k95_18)#USHSstu means USHS study burnout
#inspect the unique value of USHSstu to tell if data wrangling is needed
USHSstu %>% apply(2, function(x)unique(x))
##      k95_1 k95_2 k95_3 k95_4 k95_5 k95_6 k95_7 k95_8 k95_9 k95_10 k95_11 k95_12
## [1,]     1     1     1     1     1     1     1     1     1      3      3      3
## [2,]     4     2     2     4     2     2     3     2     3      4      4      2
## [3,]     2     5    NA     2     4     6     4     4     2      2      5      4
## [4,]     3     3     3    NA     5     3     2     3    NA     NA      2      5
## [5,]     5     4     4     3    NA     4    NA     6     4      1      6      1
## [6,]    NA    NA     5     6     3    NA     6    NA     5      5     NA     NA
## [7,]     6     6     6     5     6     5     5     5     6      6      1      6
##      k95_13 k95_14 k95_15 k95_16 k95_17 k95_18
## [1,]      1      3      1      2      3      1
## [2,]      4      5      4      4      4      5
## [3,]      2      1      3      3      1      2
## [4,]      3      4      2      5      2      4
## [5,]      5      2      5      6     NA      3
## [6,]     NA      6     NA      1      5     NA
## [7,]      6     NA      6     NA      6      6

Analysis 1: Factor analysis of general health questionnaire

1. Analysis

1.1 Descriptive statistics for variables

  In this section, descriptive statistics was done for each variable of USHSghq.

#inspect the unique values in k23~k34
USHSghq %>% 
  apply(., 2, function(x)unique(x))%>%
  knitr::kable()
k23 k24 k25 k26 k27 k28 k29 k30 k31 k32 k33 k34
2 2 2 2 3 2 2 2 1 1 1 2
3 1 3 1 2 1 4 1 2 2 2 3
4 3 1 3 1 3 3 3 3 3 3 1
NA 4 4 NA 4 4 1 NA 4 4 4 4
1 NA NA 4 NA NA NA 4 NA NA NA NA

  Findings were all variables contain values of 1,2,3,4 and NA.

#visualizing the number of values for each item
longv1 <- USHSv2 %>%  #There are two patterns of choices in k23:k34. Select 
                      #items of pattern one and convert them to long format
  select(k24, k27, k28, k31, k32, k33) %>%  
  pivot_longer(everything(), names_to = "item", values_to = "score") 
#plot bar charts for the number of each pattern one choice by each item
p1 <- longv1 %>%   
  ggplot (aes(x= factor(score), fill = score))+
  geom_bar() +
  facet_wrap(~item)+
  theme_minimal()+
  xlab("1=not at all; 2=no more than usual;    
       3=rather more than usual; 4=much more than usual")+ #choices of pattern 1
  theme(legend.position ="none")
# keep visualising other items
longv2 <- USHSv2 %>% #There are two patterns of choices in k23:k34. Select 
                      #items of pattern two and convert them to long format
  select(k23, k25, k26, k29, k30, k34) %>% 
  pivot_longer(everything(), names_to = "item", values_to = "score") 
#plot bar charts for the number of each pattern tow choice by each item
p2 <- longv2 %>%  
  ggplot (aes(x= factor(score), fill = score))+
  geom_bar() +
  facet_wrap(~item)+
  theme_minimal()+
  xlab("1=more so than usual; 2=same as usual;  
       3=less so than usual; 4=much less than usual")+ #choices of pattern 2
  theme(legend.position ="none")
#display the charts in one graph
p1/p2

  The findings were “Not at all” and “No more than usual” was the most frequent choices for item k24, k27, k28, k31, k32 and k33, indicating most respondents were in a stable and healthy state. Finland is a developed country, so no surprise.

  “Same as usual” was the most frequent choice for item k23, k25, k26, k29, k30 and k34, indicating most respondents were in a stable and healthy state. Same finding with results from above items.

#Examine if missing values are dependent on any demographic variables
longv3 <- USHSv2 %>%  # convert health-related variables to long format
  select (fsd_id, gender, BMI.factor, k23:k34) %>% 
  pivot_longer(k23:k34, names_to = "item", values_to = "score") %>% 
  filter(is.na(score)) 
# plot stacked bar chart of NA frequencies base on ender for each item
p1 <- longv3 %>%  
  ggplot(aes(x = item, fill = gender))+
  geom_bar(position = position_fill(reverse = T))+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        plot.title = element_text(vjust = 1), 
        axis.title.y = element_text(size = 9))+
  labs(title = "Percentage of NAs by gender", 
       y = "Percentage of NAs", x ="", subtitle = "uncorrected")
# plot stacked bar chart of percentage of NA frequencies base on 
# gender by each item
p2 <- longv3 %>% 
  ggplot(aes(x = item, fill = gender))+
  geom_bar(position = position_stack(reverse = T))+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        axis.title.y = element_text(size = 9))+
  labs(title = "Number of NAs by gender", 
       y = "Number of NAs", x = "Item", subtitle = "uncorrected")
#create contingency table for item&gender
Igender <- longv3 %>% count(item, gender)
#correct for effect of sample size by dividing male 
#NA number with the sample size of male(n = 1068),  
#and female NA number with that of female(n = 2022)
Igender_cor <- Igender %>%  
  mutate (score = case_when(gender == "Male" ~ n/1068,
                            gender == "Female" ~ n/2022,
                            is.na(gender) ~ n/20)) %>% filter (!is.na(gender))

#plot the corrected data
p3 <- Igender_cor %>% 
  ggplot(aes(x = item, y = score, fill = gender))+
  geom_col(position = position_fill())+
  theme(axis.text = element_text(size = 5),
        axis.text.x = element_text (angle = 45))+
  labs(title = "Percent of NAs by gender", 
       subtitle ="corrected for sample size",
       y = "Corrected number of NAs", x = "Item") +
  guides(fill = guide_legend(title = "Gender"))
#display the uncorrected (left) and the corrected (right)
p1/p2|p3

  The findings were base on raw data, females givng NA answers were much more than males (see pictures on the left). However, we should bring into consideration the fact that female respondents were around two times the umber of males. Hence, the number was corrected by diving the number of each subgroup of NAs by the sample size of subgroups(male and female), obtaining the graph corrected for sample size. See picture on the right. It demonstrated that though the proportion of females giving NA answers is still larger than males, the difference becomes much less pronounced. Still, attention should be given to items k23( Have you recently been able to concentrate on your tasks?“) and k25(”Have you recently felt that you are playing a useful part in things?“), which around 80% of NAs were from females, as well as items k27 (”Have you recently felt constantly under strain?“), which around 70% NAs were from males, indicating bias such as floor effect might exist in these items.

#stacked bar chart of number of NAs by BMI ranges for each item
p4 <- longv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
  geom_bar(position = position_fill(reverse = T))+
  ylab("Percent of NAs")+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        axis.title.y = element_text (size = 9), 
        plot.title = element_text (vjust = 1))+
  labs(title = "Percent of NAs by BMI", x= "", subtitle = "uncorrected")
#stacked bar chart of percent of NAs by BMI ranges for each item
p5 <- longv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
  geom_bar(position = position_stack(reverse = T))+
  ylab("Number of NAs")+
  theme(legend.position = "none", axis.text = element_text(size = 6),
        axis.title.y = element_text (size = 9))+
  labs(title = "Number of NAs by BMI", x = "Item", subtitle = "uncorrected")
#corrected bar chart of percent of NAs by BMI ranges for each item
#create contingency table for item&gender
longv3 %>% count(BMI.factor)
## # A tibble: 5 x 2
##   BMI.factor        n
##   <fct>         <int>
## 1 Underweight       1
## 2 Normal weight   118
## 3 Overweight       39
## 4 Obese            25
## 5 <NA>             31
IBMI <- longv3 %>% count(item, BMI.factor)
IBMI <- IBMI %>% rename(score = n) #change the new data frame's variable name
#correct for effect of sample size by dividing number of normal weight respondents
# with the sample size of it(n = 118). Apply the same idea to other levels.
IBMI_cor <- IBMI %>%  
 mutate (score_corrected = case_when(BMI.factor == "Normal weight" ~ score/118,
                                             BMI.factor == "Overweight" ~ score/39,
                                             BMI.factor == "Obese" ~ score/25,
                                             BMI.factor == "Underweight" ~ score/1,
                                             is.na(BMI.factor) ~ score/31))

#plot the correct bar chart
p6 <- IBMI_cor %>% 
  ggplot(aes(x = item, y = score_corrected, fill = BMI.factor))+
  geom_col(position = position_fill(reverse = T))+
  theme(axis.text = element_text(size = 5), 
        axis.text.x = element_text(angle = 45))+
  labs(title = "Percent of NAs by BMI", 
       subtitle ="corrected for sample size",
       y = "Corrected number of NAs", x = "Item") +
  guides(fill = guide_legend(title = "BMI"))
#display the charts
p4/p5|p6

  Respondents with normal weight gave most NA answers (see pictures on the left). After sample size correction (see picture on right), the NA answers from different BMI groups became more equally distributed across each subcategory. However, the observation was inconclusive due to a large proportion lacking BMI information (NAs).

#descriptive statistics for general health questions. 
describe(USHSghq, IQR = T)
##     vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## k23    1 3093 2.30 0.64      2    2.27 0.00   1   4     3 0.58     0.49 0.01
## k24    2 3092 1.89 0.84      2    1.82 1.48   1   4     3 0.60    -0.40 0.02
## k25    3 3091 2.12 0.60      2    2.09 0.00   1   4     3 0.93     2.18 0.01
## k26    4 3091 2.11 0.52      2    2.07 0.00   1   4     3 0.99     3.02 0.01
## k27    5 3094 2.36 0.87      2    2.33 1.48   1   4     3 0.09    -0.69 0.02
## k28    6 3089 1.85 0.83      2    1.77 1.48   1   4     3 0.64    -0.36 0.01
## k29    7 3092 2.18 0.61      2    2.17 0.00   1   4     3 0.71     1.22 0.01
## k30    8 3091 2.11 0.52      2    2.07 0.00   1   4     3 0.91     2.80 0.01
## k31    9 3087 1.94 0.89      2    1.87 1.48   1   4     3 0.52    -0.72 0.02
## k32   10 3092 1.84 0.87      2    1.75 1.48   1   4     3 0.68    -0.47 0.02
## k33   11 3096 1.65 0.85      1    1.52 0.00   1   4     3 1.07     0.14 0.02
## k34   12 3098 2.09 0.66      2    2.07 0.00   1   4     3 0.56     0.89 0.01
##     IQR
## k23   1
## k24   1
## k25   0
## k26   0
## k27   1
## k28   1
## k29   0
## k30   0
## k31   2
## k32   1
## k33   1
## k34   0

  Findings were the mean and median were very close to each other. Standard deviations were more varied, indicating some items have different filling out pattern with others.

# convert health-related variables to long format
longv4<- USHSghq %>%  
  pivot_longer(everything(), names_to = "Item", values_to = "Score")
# draw histogram of the scores for each item
longv4 %>% ggplot(aes(x = Score, fill = Score)) +
  geom_histogram(binwidth=0.9, fill = "cornsilk3") +
  facet_wrap(~Item)+
  labs(caption="Score is each respondent's choice for each item")+
  theme_bw()

  Findings were the assumption of multivariate normality was severely violated (at least half of the variables were not normal), see histogram above. This indicated I might have to exclude maximum likelihood for factoring due to its poor validity for non-normal data. 1.2 Factorability of the items

  Next, the factorability of the items was examined. Several well-recognized criteria for the factorability of a correlation were used.

#correlation to examine the factorability
c_matrix <- cor.plot(USHSghq, method = "spearman")

#These codes below are not in use anymore. 
#preparing USHSghq with categorical data 
#USHSghq_more <- USHSghq %>% mutate (fsd_id = USHSv2$fsd_id, gender = USHSv2$gender, BMI.factor = USHSv2$BMI.factor, age = USHSv2$age) %>% #select (fsd_id, everything())

  It was observed that all of the 12 items correlated at least 0.4 with at least one other item, suggesting good factorability.

#perform KMO test for factorability
KMO(c_matrix) 
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = c_matrix)
## Overall MSA =  0.93
## MSA for each item = 
##  k23  k24  k25  k26  k27  k28  k29  k30  k31  k32  k33  k34 
## 0.95 0.95 0.94 0.93 0.93 0.95 0.93 0.95 0.93 0.91 0.90 0.93

  The Kaiser-Meyer-Olkin measure of sampling adequacy was .93, indicating marvelous adequacy according to Kaiser.

#perform bartlett test for factorability
cortest.bartlett(c_matrix, nrow(USHSghq))
## $chisq
## [1] 15570.3
## 
## $p.value
## [1] 0
## 
## $df
## [1] 66

  Bartlett’s test of sphericity was significant (χ2 (66) = 15570.3, p < .05), suggesting that the null hypothesis that our matrix be equal to an identity matrix was objected, and hence good factorability.

1.3 Estimating factor structure

  Next, the number of factors was estimated. Several well-recognized criteria for exploring factor structure solution were used.

# Parallel analysis and scree plot
fa.parallel(USHSghq, fa = "fa", fm = "ml", plot = T) 

## Parallel analysis suggests that the number of factors =  4  and the number of components =  NA
# very simple structure and BIC
nfactors(c_matrix, n=3, rotate = "oblimin", fm = "pa", n.obs=12)

## 
## Number of factors
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
##     n.obs = n.obs, plot = FALSE, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.88  with  1  factors
## VSS complexity 2 achieves a maximimum of 0.66  with  2  factors
## The Velicer MAP achieves a minimum of 0.02  with  1  factors 
## Empirical BIC achieves a minimum of  -129.62  with  1  factors
## Sample Size adjusted BIC achieves a minimum of  18.23  with  3  factors
## 
## Statistics by number of factors 
##   vss1 vss2   map dof chisq prob sqresid  fit RMSEA  BIC SABIC complex eChisq
## 1 0.88 0.00 0.020  54  2.89    1     4.5 0.88     0 -131    32     1.0   4.57
## 2 0.56 0.66 0.025  43  1.11    1    12.2 0.66     0 -106    24     1.1   1.70
## 3 0.47 0.58 0.035  33  0.45    1    13.2 0.64     0  -82    18     1.3   0.64
##    SRMR eCRMS eBIC
## 1 0.054 0.059 -130
## 2 0.033 0.041 -105
## 3 0.020 0.028  -81
#below code is discarded 
#vss(c_matrix,n=5, rotate = "oblimin", fm = "pa", n.obs=12)

  The parallel analysis and scree plot (showing 4 factors above eigen-values of simulated data ) based on correlation matrix suggested 4-factor structure, while the VSS was favourable to a uni-dimensional interpretation. Results from the VSS analysis also pointed to the optimal number of one factors, this was further consolidated by both Velicer MAP and SABIC, both suggesting a single dimension.

1.4 Presetting parameters

I loaded the variables on 4 factors based on the result of parallel analysis. A 3-factor solution was also tested for exercise purpose, albeit no mathematical foundation supporting to do it. Oblimin rotation will be adopted because of the fact that dimensions of health situations were usually highly correlated and the hand-on-exercise examining a 2-factor solution has given a correlation of 0.75 between the factors. I assumed that there was some latent construct called “general health” that is influencing how people answer these questions, and hence principle axis factoring (PAF) was adopted instead of principle component analysis (PCA; on other hand, PCA is orthogonal by nature). Since the assumption of multivariate normality is severely violated, I also did not adopt maximum likelihood factoring.

1.4 Factor analysis

#Examining the 4 factors solution
fa4 <- fa(USHSghq, nfactors = 4, rotate = "oblimin", fm = "pa", scores = "regression")
fa4
## Factor Analysis using method =  pa
## Call: fa(r = USHSghq, nfactors = 4, rotate = "oblimin", scores = "regression", 
##     fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       PA1   PA2   PA4   PA3   h2   u2 com
## k23 -0.04  0.46  0.09  0.21 0.40 0.60 1.5
## k24  0.21  0.05  0.02  0.42 0.37 0.63 1.5
## k25  0.08  0.56  0.09 -0.04 0.43 0.57 1.1
## k26  0.03  0.77 -0.03 -0.03 0.58 0.42 1.0
## k27  0.05  0.02  0.01  0.62 0.44 0.56 1.0
## k28  0.43  0.11  0.03  0.30 0.54 0.46 2.0
## k29 -0.09  0.16  0.49  0.24 0.50 0.50 1.8
## k30  0.03  0.37  0.27  0.08 0.44 0.56 1.9
## k31  0.54 -0.06  0.27  0.16 0.65 0.35 1.7
## k32  0.75  0.07 -0.03  0.06 0.65 0.35 1.0
## k33  0.85  0.03  0.02 -0.05 0.72 0.28 1.0
## k34  0.09  0.04  0.75 -0.04 0.68 0.32 1.0
## 
##                        PA1  PA2  PA4  PA3
## SS loadings           2.21 1.67 1.39 1.13
## Proportion Var        0.18 0.14 0.12 0.09
## Cumulative Var        0.18 0.32 0.44 0.53
## Proportion Explained  0.35 0.26 0.22 0.18
## Cumulative Proportion 0.35 0.61 0.82 1.00
## 
##  With factor correlations of 
##      PA1  PA2  PA4  PA3
## PA1 1.00 0.61 0.65 0.54
## PA2 0.61 1.00 0.68 0.48
## PA4 0.65 0.68 1.00 0.55
## PA3 0.54 0.48 0.55 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  66  and the objective function was  5.02 with Chi Square of  15570.3
## The degrees of freedom for the model are 24  and the objective function was  0.04 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic number of observations is  3082 with the empirical chi square  50.89  with prob <  0.0011 
## The total number of observations was  3110  with Likelihood Chi Square =  124.11  with prob <  1.8e-15 
## 
## Tucker Lewis Index of factoring reliability =  0.982
## RMSEA index =  0.037  and the 90 % confidence intervals are  0.03 0.043
## BIC =  -68.91
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1  PA2  PA4  PA3
## Correlation of (regression) scores with factors   0.93 0.89 0.89 0.82
## Multiple R square of scores with factors          0.87 0.79 0.80 0.67
## Minimum correlation of possible factor scores     0.74 0.57 0.59 0.34
# draw a diagram of the factor model (measurement model):
fa.diagram(fa4, cut = 0, digits =2, main = "Factor Analysis, Oblimin rotation")

#print the results of analysis
print(fa4$loadings,cutoff = 0.3, sort=TRUE)
## 
## Loadings:
##     PA1    PA2    PA4    PA3   
## k31  0.539                     
## k32  0.752                     
## k33  0.848                     
## k25         0.559              
## k26         0.772              
## k34                0.754       
## k27                       0.616
## k23         0.462              
## k24                       0.420
## k28  0.425                     
## k29                0.486       
## k30         0.373              
## 
##                  PA1   PA2   PA4   PA3
## SS loadings    1.830 1.314 0.974 0.784
## Proportion Var 0.152 0.110 0.081 0.065
## Cumulative Var 0.152 0.262 0.343 0.408
#find out how much variance was explained
prop.table(fa4$values) 
##  [1]  0.813965250  0.090210809  0.057374453  0.038604906  0.012856409
##  [6]  0.005442583  0.004580568  0.001380270 -0.001539784 -0.005620033
## [11] -0.007304575 -0.009950854
#codes below is not in use
#summary(fa4)
#print(fa4, digits = 2, sort = T)
#Examining the 4 factors solution
fa3 <- fa(USHSghq, nfactors = 3, rotate = "oblimin", fm = "pa", scores = "regression")

# draw a diagram of the factor model (measurement model):
fa.diagram(fa3, cut = 0, digits =2, main = "Factor Analysis, Oblimin rotation")

#print the results of analysis
print(fa3$loadings,cutoff = 0.3, sort=TRUE)
## 
## Loadings:
##     PA2    PA1    PA3   
## k23  0.576              
## k25  0.674              
## k26  0.764              
## k29  0.504              
## k30  0.592              
## k31         0.610       
## k32         0.786       
## k33         0.900       
## k27                0.520
## k24                0.379
## k28         0.467       
## k34  0.482              
## 
##                  PA2   PA1   PA3
## SS loadings    2.241 2.146 0.678
## Proportion Var 0.187 0.179 0.056
## Cumulative Var 0.187 0.366 0.422
prop.table(fa3$values)
##  [1]  0.851802491  0.091916445  0.056417814  0.026925354  0.010959237
##  [6]  0.004800415  0.003251587 -0.002007687 -0.006115253 -0.008716340
## [11] -0.011895962 -0.017338100

  Solutions for three and four factors were each examined using oblimin rotations of the factor loading matrix.Correlation coefficients among the factors in both solutions ranges from 0.48 to 0.68, consolidating the appropriateness of oblimin rotation. Although the number of primary loading were sufficient and roughly equal for both solutions, the four factor solution, which explained 81.40% of the variance, was preferred because of: (a) hand-on-exercise found a 2 factor solution did not provide sufficient granularity for interpreting the structure; (b) the ‘leveling off’ of eigen values on the scree plot after four factors; and (c) the insufficient number of primary loadings and difficulty of interpreting 3 factor structure from medical perspective.

1.5 conclusion of factor analysis

  In the four factor structure I finally adopted, items k31, k32, k33 and k28 comprising the first factor, capturing “Social well-being”; items k25, k26, k23 and k30 comprising the second factor, capturing “social engagement”; items k34 and k29 comprising the third factor, capturing “mental well-being”; and items k27 and k24 comprising the fourth factor, capturing “emotional well-being”.

1.6 factor score analysis

1.6.1 generating factor score variables

#generate factor scores
healthScores  <-  as.data.frame( #healthScores mean score from factor analysis of general health
  factor.scores(USHSghq, fa4)$scores)  %>% 
  mutate(fsd_id = USHSv2$fsd_id)  %>% 
  rename(Social_wellbeing = PA1,
         Social_engagement = PA2,
         Emotional_wellbeing = PA3,
         Mental_wellbeing = PA4)
#merge the healthScore data set with USHSv2
USHSv2  <-  left_join(USHSv2, healthScores)
#find out the column number
ncol(USHSv2)
## [1] 108
#inspect the  factor score variables
USHSv2 %>% select(1,105:108) %>% head 
##   fsd_id Social_wellbeing Social_engagement Mental_wellbeing
## 1      1       -0.9014528        -0.3126473       -0.3676748
## 2      2       -0.8070226        -0.4297041       -0.1044619
## 3      3        0.4223485         0.3144105        1.9148226
## 4      4       -0.9401290        -0.3139020       -0.3466739
## 5      5        0.2910781        -0.1668135        1.2149583
## 6      6       -0.5946463        -0.3377928       -0.1139275
##   Emotional_wellbeing
## 1          0.36021831
## 2         -0.78260895
## 3          0.91311862
## 4         -0.01544999
## 5          1.34019985
## 6         -0.69372844
USHSv2 %>% head
##   fsd_id k1 k2 bv1 bv3 bv6 k7 k8_1 k8_2 k8_3 k8_4 k9_1 k9_2 k9_3 k9_4 k9_5 k9_6
## 1      1 NA  1   1  23   0  1    4    4    4    5   NA   NA   NA   NA    3    2
## 2      2  2  2   1  23   0  1    4    4    4    4    3    1    1   NA    2    1
## 3      3  3  1   1  23   0  1    3    3    3    3    2    2    3    1    2    2
## 4      4  2  1   1  23   0  1    4    5    5    5    2    1    3    1    2    2
## 5      5  2  1   1  23   0  2    5    4    5    5    3    2    2    1    2    2
## 6      6 NA  2   1  24   0  1    4    3    3    3    2   NA    1   NA    2   NA
##   k9_7 k9_8 k9_9 k9_10 k9_11 k9_12 k9_13 k9_14 k9_15 k9_16 k9_17 k9_18 k9_19
## 1   NA    1   NA    NA     2    NA     3    NA    NA    NA    NA    NA    NA
## 2   NA   NA   NA    NA    NA    NA     3     1     1    NA    NA    NA     1
## 3    2    2    2     3     1     3     2     2     2     2     2     1     3
## 4    2    2    1     2     1     2     1     1     1     1     1     1     2
## 5    2    1    1     1     1     1     2     1     2     2     1     2     3
## 6   NA   NA   NA     1    NA    NA     1    NA     1    NA    NA    NA    NA
##   k9_20 k9_21 k9_22 k9_23 k9_24 k9_25 k9_26 k9_27 k9_28 k9_29 k9_30 k13 k14 k15
## 1    NA     1     1    NA     1    NA    NA    NA    NA    NA    NA   3   1   0
## 2    NA     1     1    NA     3     3     2     1     1    NA    NA   3   1   0
## 3     2     1     1     1     3     2     2     3     2     2     1   4   1   0
## 4     1     2     1     1     1     1     2     1     1     1     1   4   1   0
## 5     1     1     1     1     3     3     2     2     3     1     1   2   2   0
## 6    NA    NA     1    NA     1     1     1    NA    NA    NA     1   3   1   0
##   k16 k17 k18 k19 k20 k21_1 k21_2 k21_3 k21_4 k21_5 k21_6 k21_7 k21_8 k21_9
## 1   0   0   0   0   0    -2    -1     0     1     1     1    -1     0     0
## 2   0   0   0   0   0     0     0     0     0     0     0     0     0     0
## 3   0   0   1   1   0    -1     1     0    NA     1     1    -2    -1    -1
## 4   0   0   0   0   0    -1     1     1     1     1     1     1     0     1
## 5   0   0   0   0   0    -1     2     1     1     1     2     2     1     0
## 6   0   0   0   0   0    -1    -1     0     0     0     0    -2     0     0
##   k22_1 k22_2 k22_3 k22_4 k22_5 k22_6 k22_7 k22_8 k22_9 k22_10 k23 k24 k25 k26
## 1     1     0     1     0     0     0     1     0     1      0   2   2   2   2
## 2     1     0     0     0     0     0     0     0     0      0   2   1   2   2
## 3     2     4     2     3     1     1     1     2     2      2   2   2   3   2
## 4     1     0     1     1     0     0     0     0     0      0   2   1   2   2
## 5     2     1     1     1     0     0     3     1     2      2   3   3   2   2
## 6     1     0     2     0     0     0     1     0     0      0   2   2   2   2
##   k27 k28 k29 k30 k31 k32 k33 k34 k95_1 k95_2 k95_3 k95_4 k95_5 k95_6 k95_7
## 1   3   2   2   2   1   1   1   2     1     1     1     1     1     1     1
## 2   2   1   2   2   2   1   1   2     4     2     2     4     2     2     3
## 3   3   2   4   2   3   2   2   3     2     2     2     2     2     6     4
## 4   3   2   2   2   1   1   1   2     3     2    NA     1     4     1     2
## 5   3   2   2   2   4   2   1   3     4     5     3     4     5     3     3
## 6   1   2   2   2   2   1   1   2     1     3     2     2     4     3     1
##   k95_8 k95_9 k95_10 k95_11 k95_12 k95_13 k95_14 k95_15 k95_16 k95_17 k95_18
## 1     1     1      3      3      3      1      3      1      2      3      1
## 2     2     3      4      4      3      4      5      4      4      4      5
## 3     4     2      2      4      3      2      3      3      3      3      2
## 4     3     1      3      5      2      4      1      2      2      1      2
## 5     4     3      3      4      4      3      4      2      5      4      4
## 6     2     2      2      2      2      2      2      2      2      2      3
##   height weight gender   age      BMI    BMI.factor Social_wellbeing
## 1    180     80   Male  <NA> 24.69136 Normal weight       -0.9014528
## 2    172     55 Female 22-24 18.59113 Normal weight       -0.8070226
## 3    186     94   Male 25-27 27.17077    Overweight        0.4223485
## 4    183     85   Male 22-24 25.38147    Overweight       -0.9401290
## 5    185     66   Male 22-24 19.28415 Normal weight        0.2910781
## 6    164     51 Female  <NA> 18.96193 Normal weight       -0.5946463
##   Social_engagement Mental_wellbeing Emotional_wellbeing
## 1        -0.3126473       -0.3676748          0.36021831
## 2        -0.4297041       -0.1044619         -0.78260895
## 3         0.3144105        1.9148226          0.91311862
## 4        -0.3139020       -0.3466739         -0.01544999
## 5        -0.1668135        1.2149583          1.34019985
## 6        -0.3377928       -0.1139275         -0.69372844

1.6.2 inspection of factor scores

# inspect the normality of factor scores using histogram
USHSv2 %>% select (fsd_id, 100:108) %>% 
  pivot_longer(c(7:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_wrap(~Factor) +
  theme_bw()+
  theme(legend.position = "none")

   Base on face value, factor score of emotional well-being and social engagement seemed to be normally distributed, while mental and social well-beings did not. See histograms above.

# inspect the normality of factor scores by gender using histogram
USHSv2 %>% select (fsd_id, 100:108) %>% 
  pivot_longer(c(7:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_grid(Factor~gender)+
  theme_bw()+
  theme(legend.position = "none")

   Base on face value, grouped by gender, factor scores of the dimensions stayed roughly the same with ungrouped data. See histograms above.

# inspect the normality of factor scores by BMI.factor using histogram
USHSv2 %>% select (fsd_id, 100:108) %>% 
  pivot_longer(c(7:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_grid(BMI.factor~Factor)+
  theme_bw()+
  theme(legend.position = "none")

   Base on face value, grouped by BMI ranges, factor scores of the dimensions stayed roughly the same with un-grouped data. See histograms above.

# inspect the normality of factor scores for social wellbeing using QQplot
hp1 <- USHSv2 %>%  
  ggplot(aes(sample = Social_wellbeing)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Social_wellbeing")+
  theme(plot.title = element_text(hj = 0.5))

# inspect the normality of factor scores for Social engagement using QQplot
hp2 <- USHSv2 %>%  
  ggplot(aes(sample = Social_engagement)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Social_engagement")+
  theme(plot.title = element_text(hj = 0.5))

# inspect the normality of factor scores for Mental wellbeing using QQplot
hp3 <- USHSv2 %>%  
  ggplot(aes(sample = Mental_wellbeing)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Mental_wellbeing")+
  theme(plot.title = element_text(hj = 0.5))

# inspect the normality of factor scores for emotional wellbeing using QQplot
hp4 <- USHSv2 %>%  
  ggplot(aes(sample = Emotional_wellbeing)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Emotional_wellbeing")+
  theme(plot.title = element_text(hj = 0.5))

hp1+hp2+hp3+hp4

   Qqplots showed emotional well-bing scores were roughly normally distributed but other 3 dimensions were considerably skewed.

#KS test for factor scores
USHSv2 %>% 
  select(Social_wellbeing:Emotional_wellbeing) %>% 
  apply(., 2, function(x)ks.test(x, "pnorm"))
## $Social_wellbeing
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.11663, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
## 
## $Social_engagement
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.18316, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
## 
## $Mental_wellbeing
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.1696, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
## 
## $Emotional_wellbeing
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.053698, p-value = 6.566e-08
## alternative hypothesis: two-sided

   KS test for normality was adopted considering the considerably large sample size. The results were all the 4 dimensions were not normally distributed.

   In summary, the observation of histogram and qqplot gave partially consistent conclusion from KS normality test for the 4 dimensions. The dispute was on whether emotional well-being score be normally distributed. For reducing the type I error in the following hypothesis testing, I adopted a conservative decision that pressure is not normally distributed (so that non-parametric test was adopted).

# Correlation among the factor scores of 4 dimensions. 
hp5 <- USHSv2 %>% 
  ggplot(aes(x = Emotional_wellbeing, y = Social_wellbeing))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp6 <- USHSv2 %>% 
  ggplot(aes(x = Emotional_wellbeing, y = Social_engagement))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp7 <- USHSv2 %>% 
  ggplot(aes(x = Emotional_wellbeing, y = Mental_wellbeing))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp8 <- USHSv2 %>% 
  ggplot(aes(x = Social_wellbeing, y = Mental_wellbeing))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp9 <- USHSv2 %>% 
  ggplot(aes(x = Social_wellbeing, y = Social_engagement))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp10 <- USHSv2 %>% 
  ggplot(aes(x = Social_engagement, y = Mental_wellbeing))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
#display the graphs
hp5/hp6|hp7/hp8|hp9/hp10

  From point graphs above, weak to moderate correlation was observed between each dimension and another. This corresponded to the real-world situation that every aspect of well-being would influence each other. The strongest correlation was between social engagement and mental well-being. This made sense since hunch feeling is the more one engaged in social activity, the more mental well-being he/she could enjoy. They could possibly have some causal relationship, which other pairs didn’t. This further proved the goodness of our factor structure.

1.6.3 Hypothesis testing

  Previous studies have found that BMI is a good gauge of risk for diseases that can occur with more body fat, indicating the higher our BMI, the higher our risk for certain diseases such as heart disease, high blood pressure, type 2 diabetes, gallstones, breathing problems, and certain cancers. As such, it would be interesting to explore its relation with Finnish student’s general health. Hence I tested the following hypotheses:

  a. Students in different BMI ranges have different levels of social well-being.   b. Students in different BMI ranges have different levels of social engagement.   c. Students in different BMI ranges have different levels of mental well-being.   d. Students in different BMI ranges have different levels of emotional well-being.

# Test if student with different BMI ranges have different social well-being scores.
het0 <- USHSv2 %>%  
  summary_factorlist(dependent = "Social_wellbeing", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(het0, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) 0.1 (-0.8 to 0.4) 0.043
Normal weight Median (IQR) -0.3 (-0.9 to 0.5)
Overweight Median (IQR) -0.2 (-0.9 to 0.6)
Obese Median (IQR) -0.1 (-0.8 to 1.0)

  I found students in different BMI ranges have different sense of social well-being. Pairwise comparison was performed to find out source of the difference.

#multiple comparison without correction
  pairwise.t.test(USHSv2$Social_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2$Social_wellbeing and USHSv2$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 0.47034     -             -         
## Overweight    0.91700     0.09627       -         
## Obese         0.21205     0.00057       0.02298   
## 
## P value adjustment method: none
# multiple comparison with corrected p value from bonferroni method
  pairwise.t.test(USHSv2$Social_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2$Social_wellbeing and USHSv2$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 1.0000      -             -         
## Overweight    1.0000      0.5776        -         
## Obese         1.0000      0.0034        0.1379    
## 
## P value adjustment method: bonferroni

  Multiple comparison was done to test which BMI range(s) concurred with different sense of social well-being. I found obese students were different from normal weight(p=0.00057) or overweight students(p=0.02298) in this regard . However,after correcting p values by bonferroni method, only difference between normal weight and obese students was still significant (p=0.0034).

# Test if students in different BMI ranges have different levels of social engagement.
het1 <- USHSv2 %>%  
  summary_factorlist(dependent = "Social_engagement", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1,
                     p = T)
  knitr::kable(het1, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) -0.3 (-0.4 to 0.6) 0.022
Normal weight Median (IQR) -0.3 (-0.4 to 0.1)
Overweight Median (IQR) -0.3 (-0.4 to 0.2)
Obese Median (IQR) -0.2 (-0.3 to 0.5)

  It is found students in different BMI ranges have different sense of social engagement (p=0.022). Pairwise comparison was performed to find out the source of difference.

#multiple comparison without correction
  pairwise.t.test(USHSv2$Social_engagement, USHSv2$BMI.factor, 
                p.adjust.method = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2$Social_engagement and USHSv2$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 0.3613      -             -         
## Overweight    0.7692      0.0988        -         
## Obese         0.4919      0.0065        0.1070    
## 
## P value adjustment method: none
# multiple comparison with corrected p value from bonferroni method
  pairwise.t.test(USHSv2$Social_engagement, USHSv2$BMI.factor, 
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2$Social_engagement and USHSv2$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 1.000       -             -         
## Overweight    1.000       0.593         -         
## Obese         1.000       0.039         0.642     
## 
## P value adjustment method: bonferroni

  Multiple comparison was done to test which BMI range(s) concurred with different sense of social engagement. It is found obese students were different from normal weight students (p=0.0065) in this regard. After correcting p values by bonferroni method, the difference was still significant (p=0.039).

# Test if students in different BMI ranges have different sense of mental well-being.
het2 <- USHSv2 %>%  
  summary_factorlist(dependent = "Mental_wellbeing", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(het2, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) -0.2 (-0.3 to 0.2) 0.017
Normal weight Median (IQR) -0.2 (-0.4 to 0.2)
Overweight Median (IQR) -0.2 (-0.4 to 0.4)
Obese Median (IQR) -0.2 (-0.3 to 0.9)

  It is found students in different BMI ranges have different sense of mental well-being (p=0.017). Pairwise comparison was performed to find out the source of difference.

#multiple comparison without correction
  pairwise.t.test(USHSv2$Mental_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2$Mental_wellbeing and USHSv2$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 0.528       -             -         
## Overweight    0.943       0.131         -         
## Obese         0.217       0.001         0.028     
## 
## P value adjustment method: none
# multiple comparison with corrected p value from bonferroni method
  pairwise.t.test(USHSv2$Mental_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2$Mental_wellbeing and USHSv2$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 1.0000      -             -         
## Overweight    1.0000      0.7858        -         
## Obese         1.0000      0.0063        0.1663    
## 
## P value adjustment method: bonferroni

  Multiple comparison was done to test which BMI range(s) concurred with different sense of mental well-being. It is found obese students were different from normal weight(p=0.001) or overweight students (p=0.028) in this regard. After correcting p values by bonferroni method, only the difference between obese and normal weight students was still significant (p=0.0063).

# Test if students in different BMI ranges have different sense of emotional well-being.
het3 <- USHSv2 %>%  
  summary_factorlist(dependent = "Emotional_wellbeing", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(het3, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) -0.1 (-0.7 to 0.6) 0.011
Normal weight Median (IQR) -0.2 (-0.7 to 0.6)
Overweight Median (IQR) -0.1 (-0.6 to 0.7)
Obese Median (IQR) 0.2 (-0.6 to 0.9)

  It is found students in different BMI ranges have different sense of emotional well-being (p=0.011). Pairwise comparison was performed to find out the source of difference.

#multiple comparison
  pairwise.t.test(USHSv2$Emotional_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2$Emotional_wellbeing and USHSv2$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 0.5048      -             -         
## Overweight    0.9509      0.0479        -         
## Obese         0.3192      0.0038        0.1107    
## 
## P value adjustment method: none
# multiple comparison with corrected p value from bonferroni method
  pairwise.t.test(USHSv2$Emotional_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2$Emotional_wellbeing and USHSv2$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 1.000       -             -         
## Overweight    1.000       0.287         -         
## Obese         1.000       0.023         0.664     
## 
## P value adjustment method: bonferroni

  Multiple comparison was done to test which BMI range(s) concurred with different sense of emotional well-being It is found normal weight students were different from obese (p=0.0038) or overweight students (p=0.0479) in this regard . After correcting p values by bonferroni method, only the difference between obese and normal weight students was still significant (p=0.023).

In summary, obese and normal weight students have different sense of emotional well-being, mental well-being, social engagement and social well-being. Note that this is a data-driven analysis and the p values were not corrected for the number of hypotheses.

Analysis 2: Factor analysis of study burnout questionnaire

1. Analysis

1.1 Descriptive statistics for variables

  In this section, descriptive statistics was done for each variable of USHSghq.

#have a look at the unique values in k23~k34
USHSstu %>% 
  apply(., 2, function(x)unique(x))%>%
  knitr::kable()
k95_1 k95_2 k95_3 k95_4 k95_5 k95_6 k95_7 k95_8 k95_9 k95_10 k95_11 k95_12 k95_13 k95_14 k95_15 k95_16 k95_17 k95_18
1 1 1 1 1 1 1 1 1 3 3 3 1 3 1 2 3 1
4 2 2 4 2 2 3 2 3 4 4 2 4 5 4 4 4 5
2 5 NA 2 4 6 4 4 2 2 5 4 2 1 3 3 1 2
3 3 3 NA 5 3 2 3 NA NA 2 5 3 4 2 5 2 4
5 4 4 3 NA 4 NA 6 4 1 6 1 5 2 5 6 NA 3
NA NA 5 6 3 NA 6 NA 5 5 NA NA NA 6 NA 1 5 NA
6 6 6 5 6 5 5 5 6 6 1 6 6 NA 6 NA 6 6
#adapting the variable names for better displaying effect
USHSstu <- USHSstu %>% rename(k95_01 = k95_1, k95_02 = k95_2, k95_03 = k95_3, k95_04 = k95_4, k95_05 = k95_5, k95_06 = k95_6, k95_07 = k95_7, k95_08 = k95_8, k95_09 = k95_9)

#generate two subsets of data containing negative-wording items (01~09) and postive-wording items (10~18), respectively
slongv1 <- USHSstu %>% select (k95_01:k95_09) %>% 
  pivot_longer(everything(), names_to = "item", values_to = "score") 

slongv2 <- USHSstu %>% select (k95_10:k95_18) %>% 
  pivot_longer(everything(), names_to = "item", values_to = "score") 
#visualize the number of values for negative-wording items
slongv1 %>%  
  ggplot (aes(x= factor(score), fill = score))+
  geom_bar() +
  facet_wrap(~item)+
  theme_get()+
  xlab("1=Totally disagree; 2=Disagree;3=Partly disagree; 4=Partly agree; 5=Agree; 6=Totally agree")+ 
  ggtitle("Distribution of choices for negative-wording items under k95")+
  theme(legend.position ="none", axis.title.x = element_text (size = 9, face = "bold"), axis.text.x = element_text (size = 6, vjust = 2), strip.text = element_text(size = 8, face = "bold", vjust = 0.5), axis.ticks = element_blank(), plot.title = element_text (face = "bold")) 

  The findings were “totally disagree” was the most frequent choice, followed by “disagree”, indicating most students were not experiencing study burnout in dimensions these items cover. Besides, the data were mostly skewed. This indicated care should be taken in adopting maximum likelihood factoring.

#visualize the number of values for positive-wording items
slongv2 %>%  
  ggplot (aes(x= factor(score), fill = score))+
  geom_bar() +
  facet_wrap(~item)+
  theme_get()+
  xlab("1=Totally disagree; 2=Disagree;3=Partly disagree; 4=Partly agree; 5=Agree; 6=Totally agree")+ 
  ggtitle("Distribution of choices for postive-wording items under k95")+
  theme(legend.position ="none", axis.title.x = element_text (size = 9, face = "bold"), axis.text.x = element_text (size = 6, vjust = 2), strip.text = element_text(size = 8, face = "bold", vjust = 0.5), axis.ticks = element_blank(), plot.title = element_text (face = "bold")) 

  The findings were “partly agree” was the most frequent choice, followed by “agree” and “partly disagree”, indicating most students were experiencing little to none study burnout in the dimensions these items cover. The possibility of modesty bias towards weak agreement in facing of positive descriptions should also be considered (this could explain why in negative-wording items 1~9 such pattern was not present). Besides, the data were mostly normally distributed, indicating positive wording might be a good strategy for eliciting a gradient of state changes.

#Examining if missing values are dependent on any demographic variables
slongv3 <- USHSv2 %>%  # convert health-related variables to long format
  select (fsd_id, gender, BMI.factor, k95_1:k95_18) %>% 
  pivot_longer(k95_1:k95_18, names_to = "item", values_to = "score") %>% 
  filter(is.na(score)) 
# plot stacked bar chart of NA frequencies by gender for each item
sp1 <- slongv3 %>%  
  ggplot(aes(x = item, fill = gender))+
  geom_bar(position = position_fill(reverse = T))+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        plot.title = element_text(vjust = 1), 
        axis.title.y = element_text(size = 9), axis.text.x = element_text(angle =45, vjust = 0.7))+
  labs(title = "Percentage of NAs by gender", 
       y = "Percentage of NAs", x ="", subtitle = "uncorrected")
# plot stacked bar chart of percentage of NA frequencies by gender for each item
sp2 <- slongv3 %>% 
  ggplot(aes(x = item, fill = gender))+
  geom_bar(position = position_stack(reverse = T))+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        axis.title.y = element_text(size = 9), 
        axis.text.x = element_text(angle =45, vjust = 0.7))+
  labs(title = "Number of NAs by gender", 
       y = "Number of NAs", x = "Item", subtitle = "uncorrected")
#create contingency table for item&gender
sIgender <- slongv3 %>% count(item, gender)
#correct for effect of sample size by dividing male NA number with the sample 
#size of male(n = 1068),and female NA number with that of female(n = 2022)
sIgender_cor <- sIgender %>%  #sIgender means study-burnout item/gender 
                             #sIgender_cor means study-burnout item/gender corrected
  mutate (score = case_when(gender == "Male" ~ n/1068,
                            gender == "Female" ~ n/2022,
                            is.na(gender) ~ n/20)) %>% filter (!is.na(gender))

#plotting the corrected data
sp3 <- sIgender_cor %>% 
  ggplot(aes(x = item, y = score, fill = gender))+
  geom_col(position = position_fill())+
  theme(axis.text = element_text(size = 5),
        axis.text.x = element_text (angle = 45))+
  labs(title = "Percent of NAs by gender", 
       subtitle ="corrected for sample size",
       y = "Corrected number of NAs", x = "Item") +
  guides(fill = guide_legend(title = "Gender"))
sp1/sp2|sp3

  The findings were females givng NA answers were much more than males. See pictures on the left. However, after correcting the number of each subgroup (male and female) by its sample size,the distribution of NAs became almost consistent across gender. See picture on the right.

#same idea with last chunk, with categorical variable switched to BMI ranges
sp4 <- slongv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
  geom_bar(position = position_fill(reverse = T))+
  ylab("Percent of NAs")+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        axis.title.y = element_text (size = 9), 
        plot.title = element_text (vjust = 1),
        axis.text.x = element_text(angle =45, vjust = 0.7))+
  labs(title = "Percent of NAs by BMI", x= "", subtitle = "uncorrected")
#same idea with last chunk, with categorical variable switched to BMI ranges
sp5 <- slongv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
  geom_bar(position = position_stack(reverse = T))+
  ylab("Number of NAs")+
  theme(legend.position = "none", axis.text = element_text(size = 6),
        axis.title.y = element_text (size = 9),
        axis.text.x = element_text (angle = 45, vjust = 0.7))+
  labs(title = "Number of NAs by BMI", x = "Item", subtitle = "uncorrected")
#same idea with last chunk, with categorical variable switched to BMI ranges
slongv3 %>% count(BMI.factor)
## # A tibble: 5 x 2
##   BMI.factor        n
##   <fct>         <int>
## 1 Underweight      13
## 2 Normal weight   639
## 3 Overweight      230
## 4 Obese           130
## 5 <NA>            121
sIBMI <- slongv3 %>% count(item, BMI.factor)#create contingency table for item&gender
sIBMI <- sIBMI %>% rename(score = n)
sIBMI_cor <- sIBMI %>%  #correct for effect of sample size by dividing number of  
                            #normal weight respondents with the sample size of,  
                            #it(n = 118). Apply the same idea to other levels.

 mutate (score_corrected = case_when(BMI.factor == "Normal weight" ~ score/118,
                                             BMI.factor == "Overweight" ~ score/39,
                                             BMI.factor == "Obese" ~ score/25,
                                             BMI.factor == "Underweight" ~ score/1,
                                             is.na(BMI.factor) ~ score/31))
#same idea with last chunk, with categorical variable switched to BMI ranges
sp6 <- sIBMI_cor %>% 
  ggplot(aes(x = item, y = score_corrected, fill = BMI.factor))+
  geom_col(position = position_fill(reverse = T))+
  theme(axis.text = element_text(size = 5), 
        axis.text.x = element_text(angle = 45, vjust= 0.5))+
  labs(title = "Percent of NAs by BMI", 
       subtitle ="corrected for sample size",
       y = "Corrected number of NAs", x = "Item") +
  guides(fill = guide_legend(title = "BMI"))
sp4/sp5|sp6

  Respondents with normal weight gave most NA answers (see pictures on the left). After sample size correction (see picutre on right), the NA answers from different BMI groups became more equally distributed across each subcategory. However, the observation was inconclusive due to a large proportion lacking BMI information. Also note that in the bar graph on the right, underweight respondents seemed to comprise the majority of NAs across a couple of items, but I decided to ignore it since the number of underweight respondents were quite small (n=20), leading to overestimation by correction.

#descriptive statistics for study burnout items. 
describe(USHSstu, IQR = T)
##        vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## k95_01    1 3038 3.11 1.45      3    3.06 1.48   1   6     5  0.15    -0.91
## k95_02    2 3051 2.13 1.36      2    1.90 1.48   1   6     5  1.19     0.53
## k95_03    3 3046 2.95 1.56      3    2.85 1.48   1   6     5  0.34    -0.97
## k95_04    4 3052 2.25 1.38      2    2.05 1.48   1   6     5  0.99     0.07
## k95_05    5 3062 2.39 1.55      2    2.18 1.48   1   6     5  0.86    -0.43
## k95_06    6 3057 2.43 1.57      2    2.21 1.48   1   6     5  0.85    -0.45
## k95_07    7 3053 3.27 1.60      3    3.21 1.48   1   6     5  0.09    -1.13
## k95_08    8 3047 2.82 1.68      2    2.67 1.48   1   6     5  0.47    -1.07
## k95_09    9 3047 2.23 1.40      2    2.02 1.48   1   6     5  1.00     0.02
## k95_10   10 3049 3.09 1.15      3    3.08 1.48   1   6     5  0.06    -0.43
## k95_11   11 3046 4.13 1.27      4    4.20 1.48   1   6     5 -0.47    -0.27
## k95_12   12 3047 3.20 1.31      3    3.19 1.48   1   6     5  0.13    -0.58
## k95_13   13 3049 3.22 1.21      3    3.23 1.48   1   6     5  0.00    -0.49
## k95_14   14 3044 3.69 1.29      4    3.72 1.48   1   6     5 -0.20    -0.49
## k95_15   15 3046 2.88 1.24      3    2.82 1.48   1   6     5  0.34    -0.42
## k95_16   16 3047 3.57 1.30      4    3.59 1.48   1   6     5 -0.16    -0.53
## k95_17   17 3048 3.30 1.33      3    3.33 1.48   1   6     5 -0.07    -0.71
## k95_18   18 3018 3.10 1.24      3    3.09 1.48   1   6     5  0.18    -0.46
##          se IQR
## k95_01 0.03   2
## k95_02 0.02   2
## k95_03 0.03   2
## k95_04 0.02   2
## k95_05 0.03   3
## k95_06 0.03   3
## k95_07 0.03   2
## k95_08 0.03   3
## k95_09 0.03   2
## k95_10 0.02   2
## k95_11 0.02   2
## k95_12 0.02   2
## k95_13 0.02   2
## k95_14 0.02   2
## k95_15 0.02   2
## k95_16 0.02   1
## k95_17 0.02   2
## k95_18 0.02   2
?describe 

  Findings were the median of score for positive/negative wording items were closer to its item-group member and IQRs were consistent within some subgroups of items and were varied across these subgroups, indicating some items have different filling out pattern. Means were less varied across all items, but due to the non-normality for half of the item results, here mean and standard deviation would be less convincing than median and IQR.

1.2 Factorability of the items

  Next, the factorability of the items was examined. Several well-recognized criteria for the factorability of a correlation were used.

#correlation to examine the factorability
sc_matrix <- cor.plot(USHSstu) # sc_matrix means study-burnout correlation matrix

  It was observed that all of the 12 items correlated at least 0.5 with at least one other item, suggesting good factorability.

#KMO test for factorability
KMO(sc_matrix) 
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = sc_matrix)
## Overall MSA =  0.94
## MSA for each item = 
## k95_01 k95_02 k95_03 k95_04 k95_05 k95_06 k95_07 k95_08 k95_09 k95_10 k95_11 
##   0.89   0.93   0.93   0.93   0.93   0.94   0.90   0.95   0.93   0.96   0.95 
## k95_12 k95_13 k95_14 k95_15 k95_16 k95_17 k95_18 
##   0.95   0.95   0.95   0.94   0.96   0.96   0.95

  The Kaiser-Meyer-Olkin measure of sampling adequacy was .94, indicating marvelous adequacy according to Kaiser (1975).

#Bartlett test for factorability
cortest.bartlett(sc_matrix, nrow(USHSghq))
## $chisq
## [1] 36358.71
## 
## $p.value
## [1] 0
## 
## $df
## [1] 153

  Bartlett’s test of sphericity was significant (χ2 (66) = 36358.71, p < .05), suggesting that the null hypothesis that our matrix be equal to an identity matrix was objected and hence good factorability.

1.3 Estimating factor structure

  Next, the number of factors was estimated. Several well-recognized criteria for exploring factor structure solution were used.

# Parallel analysis and scree plot
fa.parallel(USHSstu, fa = "fa", fm = "ml", plot = T) 

## Parallel analysis suggests that the number of factors =  4  and the number of components =  NA
# very simple structure and BIC
nfactors(sc_matrix, n=5, rotate = "oblimin", fm = "pa", n.obs=12)

## 
## Number of factors
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
##     n.obs = n.obs, plot = FALSE, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.81  with  1  factors
## VSS complexity 2 achieves a maximimum of 0.87  with  2  factors
## The Velicer MAP achieves a minimum of 0.02  with  3  factors 
## Empirical BIC achieves a minimum of  -285.81  with  2  factors
## Sample Size adjusted BIC achieves a minimum of  39.49  with  5  factors
## 
## Statistics by number of factors 
##   vss1 vss2   map dof chisq prob sqresid  fit RMSEA  BIC SABIC complex eChisq
## 1 0.81 0.00 0.072 135 14.55    1      14 0.81     0 -321    87     1.0  89.92
## 2 0.74 0.87 0.020 118  3.28    1      10 0.87     0 -290    67     1.2   7.41
## 3 0.67 0.79 0.017 102  0.96    1      16 0.79     0 -253    56     1.2   1.59
## 4 0.67 0.75 0.021  87  0.42    1      17 0.78     0 -216    47     1.2   0.94
## 5 0.41 0.50 0.030  73  0.15    1      28 0.63     0 -181    39     1.6   0.57
##    SRMR eCRMS eBIC
## 1 0.156 0.167 -246
## 2 0.045 0.051 -286
## 3 0.021 0.025 -252
## 4 0.016 0.021 -215
## 5 0.012 0.018 -181

  Parallel analysis and scree plot (showing 4 factors above eigen-values of simulated data ) based on correlation matrix suggested 4-factor structure, while the VSS was favourable to a 2 factor interpretation by the given complexity peaking at the second factor. This result was echoed by BIC, where the minimum value was achieved with 2 factors. Whereas VSS analysis recommended a 3-factor solution, which was inconsistent with others.

1.4 Presetting parameters

  Base on the factor structure exploration, solutions for three, four, five and six factors were each examined using varimax and oblimin rotations of the factor loading matrix. Both rotation was selected since I was not able to presume the relationship between dimensions of study burnout. Histogram showed more than half of the items under k95 were normally distributed, thus maximum likelihood factoring was first considered in view of its goodness in factoring normal data, but other factoring methods might also be adopted in case ML did not give interpretable results.

1.4 Factor analysis

1.4.1 Examining 2-factor structure

# a) orthogonal rotation
sfa2_var <- fa(USHSstu, nfactors = 2,  #sfa2_var means study factor analysis 2 factors by varimax
           rotate = "varimax", fm = "ml", scores = "regression") 
sfa2_var
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 2, rotate = "varimax", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          ML1   ML2   h2   u2 com
## k95_01  0.05  0.56 0.31 0.69 1.0
## k95_02 -0.46  0.57 0.54 0.46 1.9
## k95_03 -0.17  0.77 0.62 0.38 1.1
## k95_04 -0.05  0.68 0.46 0.54 1.0
## k95_05 -0.51  0.56 0.58 0.42 2.0
## k95_06 -0.35  0.60 0.48 0.52 1.6
## k95_07 -0.06  0.76 0.58 0.42 1.0
## k95_08 -0.32  0.66 0.53 0.47 1.4
## k95_09 -0.04  0.69 0.48 0.52 1.0
## k95_10  0.73 -0.25 0.60 0.40 1.2
## k95_11  0.77 -0.12 0.60 0.40 1.0
## k95_12  0.75 -0.03 0.57 0.43 1.0
## k95_13  0.80 -0.15 0.67 0.33 1.1
## k95_14  0.86 -0.16 0.76 0.24 1.1
## k95_15  0.69 -0.02 0.48 0.52 1.0
## k95_16  0.84 -0.11 0.72 0.28 1.0
## k95_17  0.74 -0.25 0.61 0.39 1.2
## k95_18  0.79 -0.13 0.63 0.37 1.1
## 
##                        ML1  ML2
## SS loadings           6.15 4.06
## Proportion Var        0.34 0.23
## Cumulative Var        0.34 0.57
## Proportion Explained  0.60 0.40
## Cumulative Proportion 0.60 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 118  and the objective function was  1.15 
## 
## The root mean square of the residuals (RMSR) is  0.05 
## The df corrected root mean square of the residuals is  0.05 
## 
## The harmonic number of observations is  3030 with the empirical chi square  1889.87  with prob <  7.1e-317 
## The total number of observations was  3110  with Likelihood Chi Square =  3578.54  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.876
## RMSEA index =  0.097  and the 90 % confidence intervals are  0.094 0.1
## BIC =  2629.53
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy             
##                                                    ML1  ML2
## Correlation of (regression) scores with factors   0.97 0.94
## Multiple R square of scores with factors          0.93 0.88
## Minimum correlation of possible factor scores     0.87 0.76
fa.diagram(sfa2_var, cut = 0, digits =2, 
           main = "Factor Analysis, 2-factor structure, Orthogonal rotation")

#b) oblimin rotation
sfa2_obl <- fa(USHSstu, nfactors = 2,  #sfa2_obl means study factor analysis 2 factors by oblimin
               rotate = "oblimin", fm = "ml", scores = "regression")
sfa2_obl
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 2, rotate = "oblimin", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          ML1   ML2   h2   u2 com
## k95_01  0.16  0.60 0.31 0.69 1.1
## k95_02 -0.37  0.52 0.54 0.46 1.8
## k95_03 -0.02  0.78 0.62 0.38 1.0
## k95_04  0.09  0.71 0.46 0.54 1.0
## k95_05 -0.42  0.50 0.58 0.42 1.9
## k95_06 -0.24  0.56 0.48 0.52 1.4
## k95_07  0.09  0.79 0.58 0.42 1.0
## k95_08 -0.20  0.63 0.53 0.47 1.2
## k95_09  0.10  0.72 0.48 0.52 1.0
## k95_10  0.72 -0.13 0.60 0.40 1.1
## k95_11  0.78  0.02 0.60 0.40 1.0
## k95_12  0.78  0.10 0.57 0.43 1.0
## k95_13  0.81 -0.01 0.67 0.33 1.0
## k95_14  0.87 -0.01 0.76 0.24 1.0
## k95_15  0.72  0.11 0.48 0.52 1.0
## k95_16  0.86  0.04 0.72 0.28 1.0
## k95_17  0.73 -0.13 0.61 0.39 1.1
## k95_18  0.80  0.01 0.63 0.37 1.0
## 
##                        ML1  ML2
## SS loadings           6.18 4.03
## Proportion Var        0.34 0.22
## Cumulative Var        0.34 0.57
## Proportion Explained  0.61 0.39
## Cumulative Proportion 0.61 1.00
## 
##  With factor correlations of 
##       ML1   ML2
## ML1  1.00 -0.35
## ML2 -0.35  1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 118  and the objective function was  1.15 
## 
## The root mean square of the residuals (RMSR) is  0.05 
## The df corrected root mean square of the residuals is  0.05 
## 
## The harmonic number of observations is  3030 with the empirical chi square  1889.87  with prob <  7.1e-317 
## The total number of observations was  3110  with Likelihood Chi Square =  3578.54  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.876
## RMSEA index =  0.097  and the 90 % confidence intervals are  0.094 0.1
## BIC =  2629.53
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy             
##                                                    ML1  ML2
## Correlation of (regression) scores with factors   0.97 0.95
## Multiple R square of scores with factors          0.95 0.89
## Minimum correlation of possible factor scores     0.89 0.79
fa.diagram(sfa2_obl, cut = 0, digits =2, 
           main = "Factor Analysis, 2-factor structure, Oblimin rotation")

  The correlation between the 2 factors were weak (r = -0.35), indicating oblimin rotation might be a proper choice theoretically.However, by examining the items under each factor, I found the most possible interpretation for this 2 factor solution is the differnt directions of wording, since factor 1 contain items 1-9, which were negatively worded, and items 10~18 under factor 2 were all positively worded. They should of course be negatively correlated on face value, and, on the other hand, the factoring solution was not sufficiently convincing because it did not capture meaningful dimensions of study burnout.

#results by varimax rotation
print(sfa2_var$loadings,cutoff = 0.3, sort=TRUE)
## 
## Loadings:
##        ML1    ML2   
## k95_10  0.732       
## k95_11  0.766       
## k95_12  0.754       
## k95_13  0.801       
## k95_14  0.860       
## k95_15  0.691       
## k95_16  0.840       
## k95_17  0.738       
## k95_18  0.786       
## k95_01         0.558
## k95_02 -0.465  0.573
## k95_03         0.767
## k95_04         0.679
## k95_05 -0.510  0.563
## k95_06 -0.347  0.596
## k95_07         0.758
## k95_08 -0.317  0.655
## k95_09         0.690
## 
##                  ML1   ML2
## SS loadings    6.150 4.058
## Proportion Var 0.342 0.225
## Cumulative Var 0.342 0.567
prop.table(sfa2_var$values) #how much variance explained
##  [1]  0.7313916543  0.2688561406  0.0635584307  0.0199363953  0.0120100620
##  [6]  0.0055508992  0.0005967794 -0.0011079510 -0.0028283920 -0.0032838025
## [11] -0.0055632271 -0.0069200589 -0.0077340405 -0.0110581706 -0.0118488129
## [16] -0.0148476854 -0.0169155858 -0.0197926348
#results by oblimin rotation
print(sfa2_obl$loadings,cutoff = 0.3, sort=TRUE)
## 
## Loadings:
##        ML1    ML2   
## k95_10  0.717       
## k95_11  0.780       
## k95_12  0.785       
## k95_13  0.810       
## k95_14  0.870       
## k95_15  0.721       
## k95_16  0.860       
## k95_17  0.725       
## k95_18  0.799       
## k95_01         0.596
## k95_02 -0.371  0.520
## k95_03         0.778
## k95_04         0.707
## k95_05 -0.421  0.501
## k95_06         0.564
## k95_07         0.789
## k95_08         0.633
## k95_09         0.720
## 
##                  ML1   ML2
## SS loadings    6.043 3.897
## Proportion Var 0.336 0.217
## Cumulative Var 0.336 0.552
prop.table(sfa2_obl$values) #how much variance explained
##  [1]  0.7313916543  0.2688561406  0.0635584307  0.0199363953  0.0120100620
##  [6]  0.0055508992  0.0005967794 -0.0011079510 -0.0028283920 -0.0032838025
## [11] -0.0055632271 -0.0069200589 -0.0077340405 -0.0110581706 -0.0118488129
## [16] -0.0148476854 -0.0169155858 -0.0197926348

  Cross-loading was very pronounced by both rotation methods (especially varimax rotation). This indicated we might either remove some items or seek for other factor solutions. This echoed with the fact that this two-factor structure did not provide any meaningful interpretation of dimension of study burnout (it simply separated positive from negative wording items).

1.4.2 Examining 3-factor structure

# a) orthogonal rotation
sfa3_var <- fa(USHSstu, nfactors = 3,  #sfa2_var means study factor analysis 2 factors by varimax
           rotate = "varimax", fm = "ml", scores = "regression") 
sfa3_var
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 3, rotate = "varimax", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          ML1   ML2   ML3   h2   u2 com
## k95_01  0.00  0.63  0.00 0.40 0.60 1.0
## k95_02 -0.34  0.33  0.69 0.70 0.30 1.9
## k95_03 -0.15  0.70  0.31 0.61 0.39 1.5
## k95_04 -0.05  0.67  0.19 0.48 0.52 1.2
## k95_05 -0.37  0.30  0.74 0.77 0.23 1.8
## k95_06 -0.23  0.37  0.63 0.59 0.41 1.9
## k95_07 -0.08  0.80  0.14 0.67 0.33 1.1
## k95_08 -0.26  0.51  0.44 0.53 0.47 2.5
## k95_09 -0.05  0.69  0.17 0.51 0.49 1.1
## k95_10  0.74 -0.23 -0.16 0.63 0.37 1.3
## k95_11  0.70  0.03 -0.37 0.62 0.38 1.5
## k95_12  0.77 -0.03 -0.09 0.60 0.40 1.0
## k95_13  0.82 -0.15 -0.10 0.71 0.29 1.1
## k95_14  0.81 -0.05 -0.32 0.76 0.24 1.3
## k95_15  0.72 -0.05 -0.01 0.52 0.48 1.0
## k95_16  0.80 -0.03 -0.26 0.71 0.29 1.2
## k95_17  0.71 -0.18 -0.24 0.60 0.40 1.4
## k95_18  0.77 -0.07 -0.20 0.63 0.37 1.2
## 
##                        ML1  ML2  ML3
## SS loadings           5.62 3.17 2.25
## Proportion Var        0.31 0.18 0.13
## Cumulative Var        0.31 0.49 0.61
## Proportion Explained  0.51 0.29 0.20
## Cumulative Proportion 0.51 0.80 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 102  and the objective function was  0.44 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  3030 with the empirical chi square  411.02  with prob <  1.1e-38 
## The total number of observations was  3110  with Likelihood Chi Square =  1357.07  with prob <  2.8e-218 
## 
## Tucker Lewis Index of factoring reliability =  0.948
## RMSEA index =  0.063  and the 90 % confidence intervals are  0.06 0.066
## BIC =  536.75
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML3
## Correlation of (regression) scores with factors   0.96 0.92 0.88
## Multiple R square of scores with factors          0.92 0.84 0.77
## Minimum correlation of possible factor scores     0.83 0.67 0.53
fa.diagram(sfa3_var, cut = 0, digits =2, 
           main = "Factor Analysis, 3-factor structure, Orthogonal rotation")

#b) oblimin rotation
sfa3_obl <- fa(USHSstu, nfactors = 3,  #sfa2_obl means study factor analysis 2 factors by oblimin
               rotate = "oblimin", fm = "ml", scores = "regression")
sfa3_obl
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 3, rotate = "oblimin", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          ML1   ML2   ML3   h2   u2 com
## k95_01 -0.01  0.68 -0.12 0.40 0.60 1.1
## k95_02 -0.07  0.08  0.75 0.70 0.30 1.0
## k95_03 -0.04  0.63  0.24 0.61 0.39 1.3
## k95_04  0.02  0.64  0.12 0.48 0.52 1.1
## k95_05 -0.08  0.04  0.81 0.77 0.23 1.0
## k95_06  0.02  0.16  0.69 0.59 0.41 1.1
## k95_07 -0.04  0.81  0.01 0.67 0.33 1.0
## k95_08 -0.09  0.38  0.42 0.53 0.47 2.1
## k95_09  0.01  0.67  0.08 0.51 0.49 1.0
## k95_10  0.76 -0.17  0.02 0.63 0.37 1.1
## k95_11  0.61  0.19 -0.31 0.62 0.38 1.7
## k95_12  0.82  0.03  0.08 0.60 0.40 1.0
## k95_13  0.88 -0.10  0.10 0.71 0.29 1.1
## k95_14  0.76  0.09 -0.21 0.76 0.24 1.2
## k95_15  0.80 -0.02  0.17 0.52 0.48 1.1
## k95_16  0.78  0.09 -0.13 0.71 0.29 1.1
## k95_17  0.69 -0.08 -0.10 0.60 0.40 1.1
## k95_18  0.77  0.02 -0.06 0.63 0.37 1.0
## 
##                        ML1  ML2  ML3
## SS loadings           5.54 2.87 2.63
## Proportion Var        0.31 0.16 0.15
## Cumulative Var        0.31 0.47 0.61
## Proportion Explained  0.50 0.26 0.24
## Cumulative Proportion 0.50 0.76 1.00
## 
##  With factor correlations of 
##       ML1   ML2   ML3
## ML1  1.00 -0.18 -0.55
## ML2 -0.18  1.00  0.46
## ML3 -0.55  0.46  1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 102  and the objective function was  0.44 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  3030 with the empirical chi square  411.02  with prob <  1.1e-38 
## The total number of observations was  3110  with Likelihood Chi Square =  1357.07  with prob <  2.8e-218 
## 
## Tucker Lewis Index of factoring reliability =  0.948
## RMSEA index =  0.063  and the 90 % confidence intervals are  0.06 0.066
## BIC =  536.75
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML3
## Correlation of (regression) scores with factors   0.97 0.93 0.94
## Multiple R square of scores with factors          0.94 0.86 0.89
## Minimum correlation of possible factor scores     0.88 0.73 0.77
fa.diagram(sfa3_obl, cut = 0, digits =2, 
           main = "Factor Analysis, 3-factor structure, Oblimin rotation")

  The correlation among the 3 factors were weak to moderate (0.18~0.55), not pointing to any rotation method mathematically. Two methods provided factoring results with little difference. The interpretation of this 3-factor structure became easier and more enlightening than the 2-factor solution. By qualitative analysis, I assumed items 3,5,2,6 and 8 captured the dimension of exhaustion (item example: I feel a lack of motivation in my studies and often thinking of giving up); items 7, 1, 9, 4 and 3 captured sense of pressure (Example: I often have feelings of inadequacy in my studies.); other 9 items together captured sense of accomplishment (Example: I find my studies full of meaning and purpose).

#results by varimax rotation
print(sfa3_var, digits = 2, sort = T)
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 3, rotate = "varimax", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        item   ML1   ML2   ML3   h2   u2 com
## k95_13   13  0.82 -0.15 -0.10 0.71 0.29 1.1
## k95_14   14  0.81 -0.05 -0.32 0.76 0.24 1.3
## k95_16   16  0.80 -0.03 -0.26 0.71 0.29 1.2
## k95_18   18  0.77 -0.07 -0.20 0.63 0.37 1.2
## k95_12   12  0.77 -0.03 -0.09 0.60 0.40 1.0
## k95_10   10  0.74 -0.23 -0.16 0.63 0.37 1.3
## k95_15   15  0.72 -0.05 -0.01 0.52 0.48 1.0
## k95_17   17  0.71 -0.18 -0.24 0.60 0.40 1.4
## k95_11   11  0.70  0.03 -0.37 0.62 0.38 1.5
## k95_07    7 -0.08  0.80  0.14 0.67 0.33 1.1
## k95_03    3 -0.15  0.70  0.31 0.61 0.39 1.5
## k95_09    9 -0.05  0.69  0.17 0.51 0.49 1.1
## k95_04    4 -0.05  0.67  0.19 0.48 0.52 1.2
## k95_01    1  0.00  0.63  0.00 0.40 0.60 1.0
## k95_08    8 -0.26  0.51  0.44 0.53 0.47 2.5
## k95_05    5 -0.37  0.30  0.74 0.77 0.23 1.8
## k95_02    2 -0.34  0.33  0.69 0.70 0.30 1.9
## k95_06    6 -0.23  0.37  0.63 0.59 0.41 1.9
## 
##                        ML1  ML2  ML3
## SS loadings           5.62 3.17 2.25
## Proportion Var        0.31 0.18 0.13
## Cumulative Var        0.31 0.49 0.61
## Proportion Explained  0.51 0.29 0.20
## Cumulative Proportion 0.51 0.80 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 102  and the objective function was  0.44 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  3030 with the empirical chi square  411.02  with prob <  1.1e-38 
## The total number of observations was  3110  with Likelihood Chi Square =  1357.07  with prob <  2.8e-218 
## 
## Tucker Lewis Index of factoring reliability =  0.948
## RMSEA index =  0.063  and the 90 % confidence intervals are  0.06 0.066
## BIC =  536.75
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML3
## Correlation of (regression) scores with factors   0.96 0.92 0.88
## Multiple R square of scores with factors          0.92 0.84 0.77
## Minimum correlation of possible factor scores     0.83 0.67 0.53
print(sfa3_var$loadings,cutoff = 0.3, sort=TRUE)
## 
## Loadings:
##        ML1    ML2    ML3   
## k95_10  0.742              
## k95_11  0.696        -0.367
## k95_12  0.767              
## k95_13  0.825              
## k95_14  0.809        -0.323
## k95_15  0.719              
## k95_16  0.802              
## k95_17  0.715              
## k95_18  0.767              
## k95_01         0.634       
## k95_03         0.698  0.313
## k95_04         0.666       
## k95_07         0.802       
## k95_08         0.512  0.445
## k95_09         0.689       
## k95_02 -0.336  0.326  0.690
## k95_05 -0.370  0.301  0.737
## k95_06         0.370  0.631
## 
##                  ML1   ML2   ML3
## SS loadings    5.623 3.171 2.250
## Proportion Var 0.312 0.176 0.125
## Cumulative Var 0.312 0.489 0.614
summary(sfa3_var)
## 
## Factor analysis with Call: fa(r = USHSstu, nfactors = 3, rotate = "varimax", scores = "regression", 
##     fm = "ml")
## 
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 102  and the objective function was  0.44 
## The number of observations was  3110  with Chi Square =  1357.07  with prob <  2.8e-218 
## 
## The root mean square of the residuals (RMSA) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## Tucker Lewis Index of factoring reliability =  0.948
## RMSEA index =  0.063  and the 10 % confidence intervals are  0.06 0.066
## BIC =  536.75
prop.table(sfa3_var$values) #how much variance explained
##  [1]  0.6800085511  0.2523163682  0.0681477434  0.0197943731  0.0122145099
##  [6]  0.0086038849  0.0068992316  0.0035587645  0.0004914545 -0.0002512056
## [11] -0.0028692960 -0.0032646484 -0.0041252421 -0.0053082869 -0.0059471884
## [16] -0.0081640814 -0.0101700422 -0.0119348901
#results by oblimin rotation
print(sfa3_obl, digits = 2, sort = T)
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 3, rotate = "oblimin", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        item   ML1   ML2   ML3   h2   u2 com
## k95_13   13  0.88 -0.10  0.10 0.71 0.29 1.1
## k95_12   12  0.82  0.03  0.08 0.60 0.40 1.0
## k95_15   15  0.80 -0.02  0.17 0.52 0.48 1.1
## k95_16   16  0.78  0.09 -0.13 0.71 0.29 1.1
## k95_18   18  0.77  0.02 -0.06 0.63 0.37 1.0
## k95_14   14  0.76  0.09 -0.21 0.76 0.24 1.2
## k95_10   10  0.76 -0.17  0.02 0.63 0.37 1.1
## k95_17   17  0.69 -0.08 -0.10 0.60 0.40 1.1
## k95_11   11  0.61  0.19 -0.31 0.62 0.38 1.7
## k95_07    7 -0.04  0.81  0.01 0.67 0.33 1.0
## k95_01    1 -0.01  0.68 -0.12 0.40 0.60 1.1
## k95_09    9  0.01  0.67  0.08 0.51 0.49 1.0
## k95_04    4  0.02  0.64  0.12 0.48 0.52 1.1
## k95_03    3 -0.04  0.63  0.24 0.61 0.39 1.3
## k95_05    5 -0.08  0.04  0.81 0.77 0.23 1.0
## k95_02    2 -0.07  0.08  0.75 0.70 0.30 1.0
## k95_06    6  0.02  0.16  0.69 0.59 0.41 1.1
## k95_08    8 -0.09  0.38  0.42 0.53 0.47 2.1
## 
##                        ML1  ML2  ML3
## SS loadings           5.54 2.87 2.63
## Proportion Var        0.31 0.16 0.15
## Cumulative Var        0.31 0.47 0.61
## Proportion Explained  0.50 0.26 0.24
## Cumulative Proportion 0.50 0.76 1.00
## 
##  With factor correlations of 
##       ML1   ML2   ML3
## ML1  1.00 -0.18 -0.55
## ML2 -0.18  1.00  0.46
## ML3 -0.55  0.46  1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 102  and the objective function was  0.44 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  3030 with the empirical chi square  411.02  with prob <  1.1e-38 
## The total number of observations was  3110  with Likelihood Chi Square =  1357.07  with prob <  2.8e-218 
## 
## Tucker Lewis Index of factoring reliability =  0.948
## RMSEA index =  0.063  and the 90 % confidence intervals are  0.06 0.066
## BIC =  536.75
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML3
## Correlation of (regression) scores with factors   0.97 0.93 0.94
## Multiple R square of scores with factors          0.94 0.86 0.89
## Minimum correlation of possible factor scores     0.88 0.73 0.77
print(sfa3_obl$loadings,cutoff = 0.3, sort=TRUE)
## 
## Loadings:
##        ML1    ML2    ML3   
## k95_10  0.759              
## k95_11  0.613        -0.310
## k95_12  0.818              
## k95_13  0.877              
## k95_14  0.759              
## k95_15  0.798              
## k95_16  0.782              
## k95_17  0.691              
## k95_18  0.766              
## k95_01         0.679       
## k95_03         0.627       
## k95_04         0.640       
## k95_07         0.805       
## k95_09         0.673       
## k95_02                0.753
## k95_05                0.811
## k95_06                0.694
## k95_08         0.376  0.419
## 
##                  ML1   ML2   ML3
## SS loadings    5.301 2.636 2.189
## Proportion Var 0.295 0.146 0.122
## Cumulative Var 0.295 0.441 0.563
summary(sfa3_obl)
## 
## Factor analysis with Call: fa(r = USHSstu, nfactors = 3, rotate = "oblimin", scores = "regression", 
##     fm = "ml")
## 
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 102  and the objective function was  0.44 
## The number of observations was  3110  with Chi Square =  1357.07  with prob <  2.8e-218 
## 
## The root mean square of the residuals (RMSA) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## Tucker Lewis Index of factoring reliability =  0.948
## RMSEA index =  0.063  and the 10 % confidence intervals are  0.06 0.066
## BIC =  536.75
##  With factor correlations of 
##       ML1   ML2   ML3
## ML1  1.00 -0.18 -0.55
## ML2 -0.18  1.00  0.46
## ML3 -0.55  0.46  1.00
prop.table(sfa3_obl$values) #how much variance explained
##  [1]  0.6800085511  0.2523163682  0.0681477434  0.0197943731  0.0122145099
##  [6]  0.0086038849  0.0068992316  0.0035587645  0.0004914545 -0.0002512056
## [11] -0.0028692960 -0.0032646484 -0.0041252421 -0.0053082869 -0.0059471884
## [16] -0.0081640814 -0.0101700422 -0.0119348901

  Cross-loading was very considerably heavy by orthogonal rotation (7 cross-loadings, and 2 of them loading on 3 factors simultaneously), but via oblimin rotation it reduced to an acceptable situation (2 cross-loading items with loadings of 0.31 vs 0.61, and 0.37 vs 0.41 ). This indicated the results by using oblimin rotation should be preferred.

1.4.3 Examining 4-factor structure

# a) orthogonal rotation
sfa4_var <- fa(USHSstu, nfactors = 4,  #sfa2_var means study factor analysis 2 factors by varimax
           rotate = "varimax", fm = "ml", scores = "regression") 
sfa4_var
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 4, rotate = "varimax", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          ML1   ML2   ML3   ML4   h2   u2 com
## k95_01  0.00  0.64  0.01  0.01 0.40 0.60 1.0
## k95_02 -0.33  0.30  0.71  0.00 0.70 0.30 1.8
## k95_03 -0.15  0.68  0.35  0.13 0.62 0.38 1.7
## k95_04 -0.05  0.68  0.19 -0.09 0.51 0.49 1.2
## k95_05 -0.37  0.28  0.74 -0.05 0.77 0.23 1.8
## k95_06 -0.23  0.35  0.64 -0.02 0.59 0.41 1.8
## k95_07 -0.08  0.80  0.16  0.03 0.67 0.33 1.1
## k95_08 -0.26  0.49  0.47  0.08 0.54 0.46 2.6
## k95_09 -0.05  0.70  0.18 -0.04 0.52 0.48 1.2
## k95_10  0.74 -0.22 -0.19 -0.12 0.65 0.35 1.4
## k95_11  0.69  0.02 -0.35  0.15 0.63 0.37 1.6
## k95_12  0.77  0.00 -0.12 -0.20 0.65 0.35 1.2
## k95_13  0.83 -0.13 -0.13 -0.17 0.76 0.24 1.2
## k95_14  0.81 -0.06 -0.31  0.17 0.78 0.22 1.4
## k95_15  0.72 -0.04 -0.03 -0.09 0.52 0.48 1.0
## k95_16  0.81 -0.05 -0.23  0.24 0.77 0.23 1.4
## k95_17  0.71 -0.20 -0.23  0.17 0.63 0.37 1.5
## k95_18  0.76 -0.07 -0.21  0.04 0.63 0.37 1.2
## 
##                        ML1  ML2  ML3  ML4
## SS loadings           5.62 3.10 2.35 0.27
## Proportion Var        0.31 0.17 0.13 0.01
## Cumulative Var        0.31 0.48 0.61 0.63
## Proportion Explained  0.50 0.27 0.21 0.02
## Cumulative Proportion 0.50 0.77 0.98 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 87  and the objective function was  0.28 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic number of observations is  3030 with the empirical chi square  248.15  with prob <  2.1e-17 
## The total number of observations was  3110  with Likelihood Chi Square =  853.75  with prob <  3e-126 
## 
## Tucker Lewis Index of factoring reliability =  0.963
## RMSEA index =  0.053  and the 90 % confidence intervals are  0.05 0.057
## BIC =  154.06
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML3   ML4
## Correlation of (regression) scores with factors   0.96 0.91 0.88  0.67
## Multiple R square of scores with factors          0.92 0.83 0.77  0.45
## Minimum correlation of possible factor scores     0.85 0.66 0.54 -0.09
fa.diagram(sfa4_var, cut = 0, digits =2, 
           main = "Factor Analysis, 4-factor structure, Orthogonal rotation")

#b) oblimin rotation
sfa4_obl <- fa(USHSstu, nfactors = 4,  #sfa2_obl means study factor analysis 2 factors by oblimin
               rotate = "oblimin", fm = "ml", scores = "regression")
sfa4_obl
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 4, rotate = "oblimin", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          ML1   ML3   ML2   ML4   h2   u2 com
## k95_01 -0.02 -0.13  0.69  0.00 0.40 0.60 1.1
## k95_02 -0.04  0.82  0.01  0.02 0.70 0.30 1.0
## k95_03 -0.02  0.32  0.55  0.15 0.62 0.38 1.8
## k95_04  0.02  0.07  0.68 -0.09 0.51 0.49 1.1
## k95_05 -0.06  0.85 -0.02 -0.03 0.77 0.23 1.0
## k95_06  0.05  0.73  0.10  0.00 0.59 0.41 1.0
## k95_07 -0.04  0.02  0.80  0.03 0.67 0.33 1.0
## k95_08 -0.07  0.49  0.30  0.09 0.54 0.46 1.8
## k95_09  0.01  0.06  0.69 -0.04 0.52 0.48 1.0
## k95_10  0.75 -0.04 -0.13 -0.11 0.65 0.35 1.1
## k95_11  0.60 -0.27  0.15  0.16 0.63 0.37 1.7
## k95_12  0.82 -0.01  0.10 -0.19 0.65 0.35 1.1
## k95_13  0.88  0.03 -0.05 -0.16 0.76 0.24 1.1
## k95_14  0.75 -0.15  0.03  0.18 0.78 0.22 1.2
## k95_15  0.79  0.12  0.00 -0.07 0.52 0.48 1.1
## k95_16  0.78 -0.02 -0.01  0.27 0.77 0.23 1.2
## k95_17  0.69 -0.03 -0.16  0.19 0.63 0.37 1.3
## k95_18  0.75 -0.06  0.01  0.05 0.63 0.37 1.0
## 
##                        ML1  ML3  ML2  ML4
## SS loadings           5.45 2.81 2.75 0.32
## Proportion Var        0.30 0.16 0.15 0.02
## Cumulative Var        0.30 0.46 0.61 0.63
## Proportion Explained  0.48 0.25 0.24 0.03
## Cumulative Proportion 0.48 0.73 0.97 1.00
## 
##  With factor correlations of 
##       ML1   ML3   ML2   ML4
## ML1  1.00 -0.55 -0.17  0.13
## ML3 -0.55  1.00  0.52 -0.19
## ML2 -0.17  0.52  1.00  0.09
## ML4  0.13 -0.19  0.09  1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 87  and the objective function was  0.28 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic number of observations is  3030 with the empirical chi square  248.15  with prob <  2.1e-17 
## The total number of observations was  3110  with Likelihood Chi Square =  853.75  with prob <  3e-126 
## 
## Tucker Lewis Index of factoring reliability =  0.963
## RMSEA index =  0.053  and the 90 % confidence intervals are  0.05 0.057
## BIC =  154.06
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1  ML3  ML2  ML4
## Correlation of (regression) scores with factors   0.97 0.95 0.93 0.71
## Multiple R square of scores with factors          0.94 0.90 0.86 0.51
## Minimum correlation of possible factor scores     0.89 0.80 0.73 0.01
fa.diagram(sfa4_obl, cut = 0, digits =2, 
           main = "Factor Analysis, 4-factor structure, Oblimin rotation")

  The correlation among the 4 factors were weak to moderate (0.12~0.55), not pointing to any rotation method mathematically. Two methods provided factoring results with little difference. However, via both rotation methods, the variables still loaded on only 3 factors instead of 4 factors as we specified, and the factoring was the same with that of the 3-factor solution.

#results by varimax rotation
print(sfa4_var, digits = 2, sort = T)
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 4, rotate = "varimax", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        item   ML1   ML2   ML3   ML4   h2   u2 com
## k95_13   13  0.83 -0.13 -0.13 -0.17 0.76 0.24 1.2
## k95_14   14  0.81 -0.06 -0.31  0.17 0.78 0.22 1.4
## k95_16   16  0.81 -0.05 -0.23  0.24 0.77 0.23 1.4
## k95_12   12  0.77  0.00 -0.12 -0.20 0.65 0.35 1.2
## k95_18   18  0.76 -0.07 -0.21  0.04 0.63 0.37 1.2
## k95_10   10  0.74 -0.22 -0.19 -0.12 0.65 0.35 1.4
## k95_15   15  0.72 -0.04 -0.03 -0.09 0.52 0.48 1.0
## k95_17   17  0.71 -0.20 -0.23  0.17 0.63 0.37 1.5
## k95_11   11  0.69  0.02 -0.35  0.15 0.63 0.37 1.6
## k95_07    7 -0.08  0.80  0.16  0.03 0.67 0.33 1.1
## k95_09    9 -0.05  0.70  0.18 -0.04 0.52 0.48 1.2
## k95_04    4 -0.05  0.68  0.19 -0.09 0.51 0.49 1.2
## k95_03    3 -0.15  0.68  0.35  0.13 0.62 0.38 1.7
## k95_01    1  0.00  0.64  0.01  0.01 0.40 0.60 1.0
## k95_08    8 -0.26  0.49  0.47  0.08 0.54 0.46 2.6
## k95_05    5 -0.37  0.28  0.74 -0.05 0.77 0.23 1.8
## k95_02    2 -0.33  0.30  0.71  0.00 0.70 0.30 1.8
## k95_06    6 -0.23  0.35  0.64 -0.02 0.59 0.41 1.8
## 
##                        ML1  ML2  ML3  ML4
## SS loadings           5.62 3.10 2.35 0.27
## Proportion Var        0.31 0.17 0.13 0.01
## Cumulative Var        0.31 0.48 0.61 0.63
## Proportion Explained  0.50 0.27 0.21 0.02
## Cumulative Proportion 0.50 0.77 0.98 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 87  and the objective function was  0.28 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic number of observations is  3030 with the empirical chi square  248.15  with prob <  2.1e-17 
## The total number of observations was  3110  with Likelihood Chi Square =  853.75  with prob <  3e-126 
## 
## Tucker Lewis Index of factoring reliability =  0.963
## RMSEA index =  0.053  and the 90 % confidence intervals are  0.05 0.057
## BIC =  154.06
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML3   ML4
## Correlation of (regression) scores with factors   0.96 0.91 0.88  0.67
## Multiple R square of scores with factors          0.92 0.83 0.77  0.45
## Minimum correlation of possible factor scores     0.85 0.66 0.54 -0.09
print(sfa4_var$loadings,cutoff = 0.3, sort=TRUE)
## 
## Loadings:
##        ML1    ML2    ML3    ML4   
## k95_10  0.741                     
## k95_11  0.694        -0.354       
## k95_12  0.773                     
## k95_13  0.833                     
## k95_14  0.809        -0.309       
## k95_15  0.717                     
## k95_16  0.807                     
## k95_17  0.713                     
## k95_18  0.761                     
## k95_01         0.636              
## k95_03         0.677  0.352       
## k95_04         0.679              
## k95_07         0.797              
## k95_09         0.695              
## k95_02 -0.331  0.303  0.710       
## k95_05 -0.366         0.744       
## k95_06         0.352  0.641       
## k95_08         0.491  0.473       
## 
##                  ML1   ML2   ML3   ML4
## SS loadings    5.624 3.096 2.346 0.265
## Proportion Var 0.312 0.172 0.130 0.015
## Cumulative Var 0.312 0.484 0.615 0.630
summary(sfa4_var)
## 
## Factor analysis with Call: fa(r = USHSstu, nfactors = 4, rotate = "varimax", scores = "regression", 
##     fm = "ml")
## 
## Test of the hypothesis that 4 factors are sufficient.
## The degrees of freedom for the model is 87  and the objective function was  0.28 
## The number of observations was  3110  with Chi Square =  853.75  with prob <  3e-126 
## 
## The root mean square of the residuals (RMSA) is  0.02 
## The df corrected root mean square of the residuals is  0.02 
## 
## Tucker Lewis Index of factoring reliability =  0.963
## RMSEA index =  0.053  and the 10 % confidence intervals are  0.05 0.057
## BIC =  154.06
prop.table(sfa4_var$values) #how much variance explained
##  [1]  0.6643582253  0.2472007411  0.0671021500  0.0219126960  0.0130544435
##  [6]  0.0095806972  0.0073373726  0.0045409088  0.0019940533  0.0009718059
## [11] -0.0005320870 -0.0014614124 -0.0020711661 -0.0032972384 -0.0044956804
## [16] -0.0064040112 -0.0089287114 -0.0108627869
#results by oblimin rotation
print(sfa4_obl, digits = 2, sort = T)
## Factor Analysis using method =  ml
## Call: fa(r = USHSstu, nfactors = 4, rotate = "oblimin", scores = "regression", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        item   ML1   ML3   ML2   ML4   h2   u2 com
## k95_13   13  0.88  0.03 -0.05 -0.16 0.76 0.24 1.1
## k95_12   12  0.82 -0.01  0.10 -0.19 0.65 0.35 1.1
## k95_15   15  0.79  0.12  0.00 -0.07 0.52 0.48 1.1
## k95_16   16  0.78 -0.02 -0.01  0.27 0.77 0.23 1.2
## k95_14   14  0.75 -0.15  0.03  0.18 0.78 0.22 1.2
## k95_18   18  0.75 -0.06  0.01  0.05 0.63 0.37 1.0
## k95_10   10  0.75 -0.04 -0.13 -0.11 0.65 0.35 1.1
## k95_17   17  0.69 -0.03 -0.16  0.19 0.63 0.37 1.3
## k95_11   11  0.60 -0.27  0.15  0.16 0.63 0.37 1.7
## k95_05    5 -0.06  0.85 -0.02 -0.03 0.77 0.23 1.0
## k95_02    2 -0.04  0.82  0.01  0.02 0.70 0.30 1.0
## k95_06    6  0.05  0.73  0.10  0.00 0.59 0.41 1.0
## k95_08    8 -0.07  0.49  0.30  0.09 0.54 0.46 1.8
## k95_07    7 -0.04  0.02  0.80  0.03 0.67 0.33 1.0
## k95_09    9  0.01  0.06  0.69 -0.04 0.52 0.48 1.0
## k95_01    1 -0.02 -0.13  0.69  0.00 0.40 0.60 1.1
## k95_04    4  0.02  0.07  0.68 -0.09 0.51 0.49 1.1
## k95_03    3 -0.02  0.32  0.55  0.15 0.62 0.38 1.8
## 
##                        ML1  ML3  ML2  ML4
## SS loadings           5.45 2.81 2.75 0.32
## Proportion Var        0.30 0.16 0.15 0.02
## Cumulative Var        0.30 0.46 0.61 0.63
## Proportion Explained  0.48 0.25 0.24 0.03
## Cumulative Proportion 0.48 0.73 0.97 1.00
## 
##  With factor correlations of 
##       ML1   ML3   ML2   ML4
## ML1  1.00 -0.55 -0.17  0.13
## ML3 -0.55  1.00  0.52 -0.19
## ML2 -0.17  0.52  1.00  0.09
## ML4  0.13 -0.19  0.09  1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  153  and the objective function was  11.72 with Chi Square of  36358.71
## The degrees of freedom for the model are 87  and the objective function was  0.28 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic number of observations is  3030 with the empirical chi square  248.15  with prob <  2.1e-17 
## The total number of observations was  3110  with Likelihood Chi Square =  853.75  with prob <  3e-126 
## 
## Tucker Lewis Index of factoring reliability =  0.963
## RMSEA index =  0.053  and the 90 % confidence intervals are  0.05 0.057
## BIC =  154.06
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1  ML3  ML2  ML4
## Correlation of (regression) scores with factors   0.97 0.95 0.93 0.71
## Multiple R square of scores with factors          0.94 0.90 0.86 0.51
## Minimum correlation of possible factor scores     0.89 0.80 0.73 0.01
print(sfa4_obl$loadings,cutoff = 0.3, sort=TRUE)
## 
## Loadings:
##        ML1    ML3    ML2    ML4   
## k95_10  0.749                     
## k95_11  0.598                     
## k95_12  0.816                     
## k95_13  0.880                     
## k95_14  0.750                     
## k95_15  0.790                     
## k95_16  0.784                     
## k95_17  0.686                     
## k95_18  0.750                     
## k95_02         0.819              
## k95_05         0.848              
## k95_06         0.733              
## k95_01                0.692       
## k95_03         0.321  0.551       
## k95_04                0.681       
## k95_07                0.798       
## k95_09                0.694       
## k95_08         0.490  0.304       
## 
##                  ML1   ML3   ML2   ML4
## SS loadings    5.209 2.414 2.545 0.288
## Proportion Var 0.289 0.134 0.141 0.016
## Cumulative Var 0.289 0.423 0.565 0.581
summary(sfa4_obl)
## 
## Factor analysis with Call: fa(r = USHSstu, nfactors = 4, rotate = "oblimin", scores = "regression", 
##     fm = "ml")
## 
## Test of the hypothesis that 4 factors are sufficient.
## The degrees of freedom for the model is 87  and the objective function was  0.28 
## The number of observations was  3110  with Chi Square =  853.75  with prob <  3e-126 
## 
## The root mean square of the residuals (RMSA) is  0.02 
## The df corrected root mean square of the residuals is  0.02 
## 
## Tucker Lewis Index of factoring reliability =  0.963
## RMSEA index =  0.053  and the 10 % confidence intervals are  0.05 0.057
## BIC =  154.06
##  With factor correlations of 
##       ML1   ML3   ML2   ML4
## ML1  1.00 -0.55 -0.17  0.13
## ML3 -0.55  1.00  0.52 -0.19
## ML2 -0.17  0.52  1.00  0.09
## ML4  0.13 -0.19  0.09  1.00
prop.table(sfa4_obl$values) #how much variance explained
##  [1]  0.6643582253  0.2472007411  0.0671021500  0.0219126960  0.0130544435
##  [6]  0.0095806972  0.0073373726  0.0045409088  0.0019940533  0.0009718059
## [11] -0.0005320870 -0.0014614124 -0.0020711661 -0.0032972384 -0.0044956804
## [16] -0.0064040112 -0.0089287114 -0.0108627869

  Cross-loadings in this 4-factor solution were very pronounced (worse than the 3-factor solution). Due to fact that it provided same factoring with 3 factor solution, I did not look into it further.

  Solutions for two, three and four factors were each examined using varimax and oblimin rotations of the factor loading matrix. Correlaion coefficients among the factors in all solutions ranges from 0.12 to 0.55, not strongly pointing to the appropriateness of any rotation methods. The three factor solution by oblimin rotation, which explained 68.0% of the variance, was preferred because of: (a) 2-factor solution was hardly interpretable from the perspective of study burnout (b) although eigen values leveled off on the scree plot after four factors, in the analsyis of the four-factor structure, items still loaded on 3 variables; and (c) 2-factor and 4-factor structures were plagued by cross-loading.

1.5 conclusion of factor analysis

  In the 3-factor structure via oblimin rotation I finally adopted, items 5,2,6 and 8 captured the dimension of exhaustion (item example: I feel a lack of motivation in my studies and often thinking of giving up); items 7, 1, 9, 4 and 3 captured sense of pressure (Example: I often have feelings of inadequacy in my studies); other 9 items together captured sense of accomplishment (Example: I find my studies full of meaning and purpose).

1.6 factor score analysis

1.6.1 generating factor score variables

#generating factor scores
stuScores  <-  as.data.frame( #stuScores mean score from factor analysis of study burnout
  factor.scores(USHSstu, sfa3_obl)$scores)  %>% 
  mutate(fsd_id = USHSv2$fsd_id)  %>% 
  rename(Accomplishment = ML1,
         Pressure = ML2,
         EXhaustion = ML3)
#merge the stuscore data set with USHSv2
USHSv2  <-  left_join(USHSv2, stuScores)
ncol(USHSv2)#find out the column number
## [1] 111
USHSv2 %>% select(1,99:104, 109:111) %>% head #have a look at the categorical variables and 
##   fsd_id height weight gender   age      BMI    BMI.factor Accomplishment
## 1      1    180     80   Male  <NA> 24.69136 Normal weight     -1.1933486
## 2      2    172     55 Female 22-24 18.59113 Normal weight      0.7915160
## 3      3    186     94   Male 25-27 27.17077    Overweight     -0.6107434
## 4      4    183     85   Male 22-24 25.38147    Overweight             NA
## 5      5    185     66   Male 22-24 19.28415 Normal weight      0.3663595
## 6      6    164     51 Female  <NA> 18.96193 Normal weight     -1.1904713
##     Pressure EXhaustion
## 1 -1.5911969 -0.9734333
## 2  0.1850128 -0.2752137
## 3  0.0176132  0.3644921
## 4         NA         NA
## 5  0.5176065  1.4582816
## 6 -1.2177794  0.9315672
                                     #factor score variables (the end of the shee)

1.6.2 inspection of factor scores

# inspect the normality of factor scores using histogram
USHSv2 %>% select (fsd_id, 99:104, 109:111) %>% 
  pivot_longer(c(8:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_wrap(~Factor) +
  theme_bw()+
  theme(legend.position = "none")

   Base on face value, factor score of accomplishment was normally distributed, that of pressure was roughly normally distributed (need support from QQplot or ks test), and that of exhaustion right-skewed. See histograms above.

# inspect the normality of factor scores by gender using histogram
USHSv2 %>% select (fsd_id, 99:104, 109:111) %>% 
  pivot_longer(c(8:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_grid(Factor~gender)+
  theme_bw()+
  theme(legend.position = "none")

   Base on face value, grouped by gender, factor scores of the dimensions stayed roughly the same with ungrouped data. See histograms above.

# inspect the normality of factor scores by BMI.factor using histogram
USHSv2 %>% select (fsd_id, 99:104, 109:111) %>% 
  pivot_longer(c(8:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_grid(Factor~BMI.factor)+
  theme_bw()+
  theme(legend.position = "none")

   Base on face value, grouped by BMI ranges, factor scores of the dimensions stayed roughly the same with ungrouped data. See histograms above.

# inspect the normality of factor scores for Accomplishment using qqplot
fsp1 <- USHSv2 %>%  
  ggplot(aes(sample = Accomplishment)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Accomplishment")+
  theme(plot.title = element_text(hj = 0.5))

# linspect the normality of factor scores for Pressure using qqplot
fsp2 <- USHSv2 %>%  
  ggplot(aes(sample = Pressure)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Pressure")+
  theme(plot.title = element_text(hj = 0.5))

# inspect the normality of factor scores for Exhaustion  using qqplot
fsp3 <- USHSv2 %>%  
  ggplot(aes(sample = EXhaustion)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Exhaustion")+
  theme(plot.title = element_text(hj = 0.5))

fsp1+fsp2+fsp3

   Qqplots echoed the conclusions from histogram, showing accomplishment and pressure were roughly normally distributed but their extreme values were a bit derailed from expected values, and exhaustion was considerably skewed.

#KS test for normality
USHSv2 %>% 
  select(109:111) %>% 
  apply(., 2, function(x)ks.test(x, "pnorm"))
## $Accomplishment
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.026517, p-value = 0.03785
## alternative hypothesis: two-sided
## 
## 
## $Pressure
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.042708, p-value = 6.79e-05
## alternative hypothesis: two-sided
## 
## 
## $EXhaustion
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.11081, p-value < 2.2e-16
## alternative hypothesis: two-sided

   KS test for normality was adopted considering the considerably large sample size. The results confirmed “sense of accomplishment” was normally distributed (p=0.038); while the presumption of normal distribution for “pressure” and “exhaustion” were violated.

   In summary, the observation of histogram and QQplot gave consistent conclusion from KS normality test for accomplishment and exhaustion, which were normally and not normally distributed, respectively. For pressure, inconsistent conclusions were produced. For reducing the type I error in the following hypothesis testing, I adopted a conservative decision that pressure is not normally distributed(so that non-parametric test was adopted).

# Correlation among the factor scores of 3 dimensions. 
fsp4 <- USHSv2 %>% 
  ggplot(aes(x = Accomplishment, y = Pressure))+
  geom_point(shape = 1, colour = "blue", size =0.5)

fsp5 <- USHSv2 %>% 
  ggplot(aes(x = Accomplishment, y = EXhaustion))+
  geom_point(shape = 1, colour = "blue", size =0.5)

fsp6 <- USHSv2 %>% 
  ggplot(aes(x = Pressure, y = EXhaustion))+
  geom_point(shape = 1, colour = "blue", size =0.5)

fsp4+fsp5+fsp6

  From point graphs above, I found weak to none correlation between pressure and accomplishment, a moderate negative correlation between exhaustion and accomplishment, and a moderate positve correlation between exhaustion and pressure. This made sense since it corresponded to real-world situation where the state of exhaustion deters achievement (negative relation); the increase of pressure aggravates exhaustion (positive relation), while pressure could be either the impetus or barrier to accomplishment dependent on specific situation and individual (correlation gets offset). This further proved the goodness of our factor structure.

1.6.3 Hypothesis testing

  Previous studies have found that obesity and burnout were correlated in a general population. BMI is a good gauge of obesity. As such, it would be interesting to explore their relationship in a sample of Finnish students. Hence I tested the following hypotheses:

  a. Students in different BMI ranges have different sense of exhaustion.   b. Students in different BMI ranges have different sense of pressure.   c. Students in different BMI ranges have different sense of accomplishment.

# Test if student with different BMI ranges have different exhaustion scores.
fat0 <- USHSv2 %>%  
  summary_factorlist(dependent = "EXhaustion", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat0, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) -0.2 (-0.7 to 0.5) 0.130
Normal weight Median (IQR) -0.3 (-0.8 to 0.6)
Overweight Median (IQR) -0.3 (-0.8 to 0.7)
Obese Median (IQR) -0.1 (-0.8 to 0.9)

  We did not have sufficient evidence objecting the null hypothesis that students in different BMI ranges have different sense of exhaustion from study (p=0.13).

  However, since gender can be a confounding factor for the relationship between BMI and sense of exhaustion. I also tested it for each gender as follows:

# Test if male student with different BMI ranges have different exhaustion scores.
fat1 <- USHSv2 %>% filter (gender == "Male") %>% 
  summary_factorlist(dependent = "EXhaustion", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat1, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) 0.3 (-0.3 to 1.3) 0.125
Normal weight Median (IQR) -0.3 (-0.8 to 0.6)
Overweight Median (IQR) -0.2 (-0.8 to 0.8)
Obese Median (IQR) 0.1 (-0.6 to 0.9)
# Test if female student with different BMI ranges have different exhaustion scores.
fat2 <- USHSv2 %>% filter (gender == "Female") %>% 
  summary_factorlist(dependent = "EXhaustion", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat2, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) -0.3 (-0.8 to 0.3) 0.452
Normal weight Median (IQR) -0.3 (-0.9 to 0.6)
Overweight Median (IQR) -0.3 (-0.9 to 0.7)
Obese Median (IQR) -0.1 (-0.8 to 0.9)

  We still did not have sufficient evidence objecting the null hypothesis that male/female students in different BMI ranges have different sense of exhaustion from study (p=0.13 and 0.45, respectively).

# Test if student with different BMI ranges have different accomplishment scores.
fat3 <- USHSv2 %>%  
  summary_factorlist(dependent = "Accomplishment", 
                     explanatory = "BMI.factor", 
         
                     p = T)
  knitr::kable(fat3, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Mean (sd) 0.1 (1.0) 0.146
Normal weight Mean (sd) 0.0 (1.0)
Overweight Mean (sd) -0.1 (1.0)
Obese Mean (sd) -0.1 (1.0)

  We did not have sufficient evidence (p=0.15) objecting the null hypothesis that students in different BMI ranges have different sense of accomplishment from study.

  However, since gender can be a confounding factor for the relationship between accomplishment feeling and BMI ranges. I also tested it for each gender as follows:

# Test if male student with different BMI ranges have different accomplishment scores.
fat4 <- USHSv2 %>% filter (gender == "Male") %>% 
  summary_factorlist(dependent = "Accomplishment", 
                     explanatory = "BMI.factor", 
                      
                     p = T)
  knitr::kable(fat4, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Mean (sd) -0.3 (0.9) 0.475
Normal weight Mean (sd) -0.1 (1.0)
Overweight Mean (sd) -0.1 (1.0)
Obese Mean (sd) -0.3 (1.0)
# Test if female student with different BMI ranges have different accomplishment scores.
fat5 <- USHSv2 %>% filter (gender == "Female") %>% 
  summary_factorlist(dependent = "Accomplishment", 
                     explanatory = "BMI.factor", 
                     
                     p = T)
  knitr::kable(fat5, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Mean (sd) 0.2 (1.0) 0.231
Normal weight Mean (sd) 0.1 (1.0)
Overweight Mean (sd) -0.0 (1.0)
Obese Mean (sd) 0.0 (1.0)

  We still did not have sufficient evidence objecting the null hypothesis that male/female students in different BMI ranges have different sense of accomplishment from study (p=0.48 and 0.23, respectively).

# Test if student with different BMI ranges have different exhaustion scores.
fat6 <- USHSv2 %>%  
  summary_factorlist(dependent = "Pressure", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat6, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) 0.2 (-0.5 to 0.9) 0.024
Normal weight Median (IQR) -0.1 (-0.8 to 0.6)
Overweight Median (IQR) -0.0 (-0.8 to 0.8)
Obese Median (IQR) 0.1 (-0.7 to 1.0)

  We found sufficient evidence (p=0.024) objecting the null hypothesis that students in different BMI ranges have different sense of exhaustion from study.

  However, since gender can be a confounding factor for the relationship between exhaustion feeling and BMI ranges. I also tested it for each gender to see if the difference exists across males and females. The tests were as follows:

# Test if male student with different BMI ranges have different exhaustion scores.
fat7 <- USHSv2 %>% filter (gender == "Male") %>% 
  summary_factorlist(dependent = "Pressure", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat7, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) 0.1 (-0.2 to 0.3) 0.323
Normal weight Median (IQR) -0.4 (-1.0 to 0.3)
Overweight Median (IQR) -0.3 (-1.0 to 0.6)
Obese Median (IQR) -0.1 (-0.7 to 0.3)
# Test if female student with different BMI ranges have different exhaustion scores.
fat8 <- USHSv2 %>% filter (gender == "Female") %>% 
  summary_factorlist(dependent = "Pressure", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat8, align =c("l", "l", "r", "r", "r"))
label levels unit value p
BMI ranges Underweight Median (IQR) 0.3 (-0.5 to 1.2) 0.026
Normal weight Median (IQR) 0.0 (-0.7 to 0.7)
Overweight Median (IQR) 0.1 (-0.5 to 1.0)
Obese Median (IQR) 0.3 (-0.7 to 1.2)

  It is interesting to find out the female students in different BMI ranges have different sense of exhaustion, while similar effect was not found among males students (p=0.03 and 0.23, respectively).

#multiple comparison
USHSv2_female <- USHSv2 %>% filter (gender == "Female") 
  pairwise.t.test(USHSv2_female$Pressure, USHSv2_female$BMI.factor, 
                p.adjust.method = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2_female$Pressure and USHSv2_female$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 0.222       -             -         
## Overweight    0.876       0.019         -         
## Obese         0.697       0.012         0.408     
## 
## P value adjustment method: none
# multiple comparison with corrected p value from bonferroni method
USHSv2_female <- USHSv2 %>% filter (gender == "Female") 
  pairwise.t.test(USHSv2_female$Pressure, USHSv2_female$BMI.factor, 
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  USHSv2_female$Pressure and USHSv2_female$BMI.factor 
## 
##               Underweight Normal weight Overweight
## Normal weight 1.000       -             -         
## Overweight    1.000       0.111         -         
## Obese         1.000       0.069         1.000     
## 
## P value adjustment method: bonferroni

  Multiple comparison was done to test which BMI range(s) of female student have different sense of exhaustion. I found normal weight female students were different from obese students (p=0.012) and also different from underweight female students in exhaustion feeling(p=0.019). However, after correcting p values by bonferroni method, all of the differences turned insignificant.

  In summary, we do not have sufficient evidence to conclude students in different BMI ranges have different levels of sense of exhaustion, sense of pressure and sense of accomplishment.


GOOOOOD JOB!!

In the end, you should be ready to KNIT this document and SUBMIT the result (.html file) as your report of the Assignment 2 in Moodle. I am completely sure that you learned again a lot! GRRRRREAT!!!